A Delphi implementation of the BERT tokenizer — TBertTokenizer — inspired by the .NET FastBertTokenizer. It supports WordPiece tokenization and converts text into token ID sequences ready for input into BERT models via ONNX.
- Load vocabulary from file (
vocab.txt) or stream - Load tokenizer configuration from Hugging Face (
tokenizer.json) - Converts raw text into token ID arrays
- Decodes token IDs back into readable text
- Compatible with TONNXRuntime for ONNX inference
- Main unit:
Src/BertTokenizer/BertTokenizer.pas - Core class:
TBertTokenizer
uses
BertTokenizer, BertTokenizer.Extensions;
procedure LoadTokenizerAndEncode(const APath: string);
begin
var Tokenizer := TBertTokenizer.Create;
try
Tokenizer.LoadFromHuggingFace('TaylorAI/bge-micro-v2');
var Tokens := Tokenizer.Encode('Hello, world!');
// TokenIds can now be passed to a model using TONNXRuntime
finally
Tokenizer.Free;
end;
end;| Method | Description |
|---|---|
LoadVocabulary(FileName, ...) |
Loads vocabulary from a vocab.txt file |
LoadVocabularyFromStream(Stream, ...) |
Loads vocabulary from a stream |
LoadTokenizerJson(FileName) |
Loads tokenizer from a Hugging Face tokenizer.json |
LoadTokenizerJsonFromStream(Stream) |
Loads tokenizer from a JSON stream |
Encode(Text) |
Tokenizes input text and returns an array of token IDs |
Decode(Tokens) |
Decodes an array of token IDs back to text |
| uses BertTokenizer.Extensions | |
LoadFromHuggingFace(HuggingFaceRepo) |
Loads tokenizer from Hugging Face repo by name |
vocab.txt— standard BERT vocabulary filetokenizer.json— Hugging Face tokenizer format (WordPiece-based)
- Delphi 10.2 Tokyo+
- No external dependencies required for core functionality
- Optional test suite using
DUnitX
This project includes unit tests using DUnitX.
The output of Encode is compatible with ONNX BERT models. You can use TONNXRuntime to run inference on tokenized input (input_ids).
MIT License — free to use, modify, and distribute.