This is my model, which is based on my original LLM that I created months ago, along with a more recently developed tokenizer. The model is broken across multiple files so preprocesses can be run separately from model training, as well as allowing the model to be run separately from training as well. The model should be able to run right out of the box as long as the generate.py, model.py, vocabulary_aid.py, model.pt, and merges.json are installed and working correctly.
The model requires external libraries to run and/or train. The only one necessary to run the model is PyTorch (www.pytorch.org). However, if you wish to train the model, NumPy is required. And if you want to adjust the pre-training processes, specifically the data collection, BeautifulSoup is needed. Along with these libraries, git Large File Storage(LFS) is utilized, so when using the repository, make sure you have git LFS installed.
If you are completely training the model from scratch, here's the order in which the files need to be run. First, run trainVocab.py; this sets up the vocabulary to be used for encoding and decoding. Then, run dataCollect.py, which gathers all the wanted sites that will be used in training into a single, pre-encoded file. Currently, all data is gathered from www.gutenberg.org. With these processes complete, run train.py; this will train the model on the data. Now, once trained, run generate.py to test the model.