BigramLangaueModel Repo

This is repository about learning bothBigramLangaueModel and tokenization. The language model is implemented from scratch and predicts the next token based on the statistics from the text it been fitted to.

The data used for this project: Frankenstein (downloaded from Broject Gutenberg).

Genereated example text from the Bigramlangauemodel

Here are some text generated from the model. As you can se its holds the style of a text but of course don't make any sence since the model only take in consideration the previous token (chracter/word).

Character level tokenization

Isn't even readable since it only looks at the previues character. all chracter can in most likely be next to eachother, since is is posible to find them next to each other in words.

Haprs ce;A at. I berd o tored the asholltre wad; cow rosapuer wit thanifer comealix shel mif merolle pud.” an s, fachanlles litle t core t fegit thint hed atralltstur yon, y okqbl t aghingneerverpos ons utheds, soaveatooue issed h cofe h f mast; ttithouth beigo worfis ly fof rytagur stound is cin sisevealy upa min ure wenggedisangror, ta my.

Word level tokenization

The generated text is readeble but doesent make much sence. This is because all words that apears next to each other has been seen in the text that the bigram model has been fitted to.

In other traveller might become one look as a dark night, and inhuman infidels feet. As soon as he asked, and guided touch willowy entreat 6 Chapter when they really express to those horrors of the pangs of that strewed immoderate speaks, when so dear Frankenstein, would bestow blessings on the insurance for their victim, and that made completes our good spirit of evil enchanted lessons of the old man! My strength; rain had resolved to give them overlook my beloved girl confirmed the country. It impressed themselves.

Implementation

The follwing section holds the setup and how to use it. To try it out, run the main.py, specify the paths, etc.

Data preperation

I have decided to devide the book into sequences of its paragraphs, menaing placing bounday tokens ,BOS in the beginning and EOS at the end of each paragraph (linebreak). You can choose to split it in other ways to, e.g. sentance, documentwise, depending on what you wanna achive with the model. I want to create small stories thats why I choosed paragrapghwise. Its implemented in data_preperation_py

To add Bounady tokens:

    add_BOS_EOS()

Tokenization

In the file tokenization.py is character and word level tokenization implemented, The word level allows fo specification of punctuation to be seperated tokens. Tokenization is explained in Tokenization_explination.md.

To create a map from string → integer and integer → string, use the folloing funciton. It creates a JSON file with the mapping for the encoder and decoder to use.

    create_<tokenization>_tok_map()

Select a encoder:

    <Tokenization>_encode()

And select a decoder

    <Tokenization>_decode()