Skip to content

This project contains a Bigram Language Model and custom tokenizers, all implemented from scratch. The repository is intended as a hands-on way to learn and experiment with NLP concepts.

Notifications You must be signed in to change notification settings

Joelcic/BigramLanguageModel

Repository files navigation

BigramLangaueModel Repo

This is repository about learning bothBigramLangaueModel and tokenization. The language model is implemented from scratch and predicts the next token based on the statistics from the text it been fitted to.

The data used for this project: Frankenstein (downloaded from Broject Gutenberg).


Genereated example text from the Bigramlangauemodel

Here are some text generated from the model. As you can se its holds the style of a text but of course don't make any sence since the model only take in consideration the previous token (chracter/word).

Character level tokenization

Isn't even readable since it only looks at the previues character. all chracter can in most likely be next to eachother, since is is posible to find them next to each other in words.

Haprs ce;A at. I berd o tored the asholltre wad; cow rosapuer wit thanifer comealix shel mif merolle pud.” an s, fachanlles litle t core t fegit thint hed atralltstur yon, y okqbl t aghingneerverpos ons utheds, soaveatooue issed h cofe h f mast; ttithouth beigo worfis ly fof rytagur stound is cin sisevealy upa min ure wenggedisangror, ta my.

Word level tokenization

The generated text is readeble but doesent make much sence. This is because all words that apears next to each other has been seen in the text that the bigram model has been fitted to.

In other traveller might become one look as a dark night, and inhuman infidels feet. As soon as he asked, and guided touch willowy entreat 6 Chapter when they really express to those horrors of the pangs of that strewed immoderate speaks, when so dear Frankenstein, would bestow blessings on the insurance for their victim, and that made completes our good spirit of evil enchanted lessons of the old man! My strength; rain had resolved to give them overlook my beloved girl confirmed the country. It impressed themselves.


Implementation

The follwing section holds the setup and how to use it. To try it out, run the main.py, specify the paths, etc.

Data preperation

I have decided to devide the book into sequences of its paragraphs, menaing placing bounday tokens ,BOS in the beginning and EOS at the end of each paragraph (linebreak). You can choose to split it in other ways to, e.g. sentance, documentwise, depending on what you wanna achive with the model. I want to create small stories thats why I choosed paragrapghwise. Its implemented in data_preperation_py

To add Bounady tokens:

    add_BOS_EOS()

Tokenization

In the file tokenization.py is character and word level tokenization implemented, The word level allows fo specification of punctuation to be seperated tokens. Tokenization is explained in Tokenization_explination.md.

To create a map from string → integer and integer → string, use the folloing funciton. It creates a JSON file with the mapping for the encoder and decoder to use.

    create_<tokenization>_tok_map()

Select a encoder:

    <Tokenization>_encode()

And select a decoder

    <Tokenization>_decode()

BigramLangaueModel

The Bigram model is implemented from scratch and is predicting the next token based only on the previous. It is drawing the token from a distribution based on the frankenstein text.

Its implemented in model.py.

    # To create an model object
    model = BigramLangaueModel()

    # To fit a text (calculate the token distribution)
    model.fit()

    # To generate text
    model.sample()

About

This project contains a Bigram Language Model and custom tokenizers, all implemented from scratch. The repository is intended as a hands-on way to learn and experiment with NLP concepts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages