PGN Tokenizer

This is a Byte Pair Encoding (BPE) tokenizer for chess Portable Game Notation (PGN).

Installation

You can install it with your package manager of choice:

uv

uv add pgn-tokenizer

pip

pip install pgn-tokenizer

Usage

It exposes a simple interface with .encode() and .decode() methods, and a .vocab_size property, but you can also access the underlying PreTrainedTokenizerFast class from the transformers library via the .tokenizer property.

from pgn_tokenizer import PGNTokenizer

# Initialize the tokenizer
tokenizer = PGNTokenizer()

# Tokenize a PGN string
tokens = tokenizer.encode("1.e4 Nf6 2.e5 Nd5 3.c4 Nb6")

# Decode the tokens back to a PGN string
decoded = tokenizer.decode(tokens)

# get vocab from underlying tokenizer class
vocab = tokenizer.tokenizer.get_vocab()

Implementation

It is uses the tokenizers library from Hugging Face for training the tokenizer and the transformers library from Hugging Face for initializing the tokenizer from the pretrained tokenizer model for faster tokenization.

Note: This is part of a work-in-progress project to investigate how language models might understand chess without an engine or any chess-specific knowledge.

Tokenizer Comparison

More traditional, language-focused BPE tokenizer implementations are not suited for PGN strings because they are more likely to break the actual moves apart.

For example 1.e4 Nf6 would likely be tokenized as 1, ., e, 4, N, f, 6 or 1, .e, 4, , N, f, 6 depending on the tokenizer's vocabulary, but with the specialized PGN tokenizer it would be tokenized as 1., e4, Nf6.

Visualization

Here is a visualization of the vocabulary of this specialized PGN tokenizer compared to the BPE tokenizer vocabularies of the cl100k_base (the vocabulary for the gpt-3.5-turbo and gpt-4 models' tokenizer) and the o200k_base (the vocabulary for the gpt-4o model's tokenizer):

PGN Tokenizer

Note: The tokenizer was trained with ~2.8 Million chess games in PGN notation with a target vocabulary size of 4096.

GPT-3.5-turbo and GPT-4 Tokenizers

GPT-4o Tokenizer

Note: These visualizations were generated with a function adapted from an educational Jupyter Notebook in the tiktoken repository.

Acknowledgements

@karpathy for the Let's build the GPT Tokenizer tutorial
Hugging Face for the tokenizers and transformers libraries.
Kaggle user MilesH14, whoever you are for the now-missing dataset of 3.5 million chess games referenced in many places, including this research documentation

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
docs		docs
scripts		scripts
src		src
templates		templates
.DS_Store		.DS_Store
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

PGN Tokenizer

Installation

uv

pip

Usage

Implementation

Tokenizer Comparison

Visualization

PGN Tokenizer

GPT-3.5-turbo and GPT-4 Tokenizers

GPT-4o Tokenizer

Acknowledgements

About

Uh oh!

Releases 7

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

DVDAGames/pgn-tokenizer

Folders and files

Latest commit

History

Repository files navigation

PGN Tokenizer

Installation

uv

pip

Usage

Implementation

Tokenizer Comparison

Visualization

PGN Tokenizer

GPT-3.5-turbo and GPT-4 Tokenizers

GPT-4o Tokenizer

Acknowledgements

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Sponsor this project

Uh oh!

Packages 0

Languages

Packages