BertTokenizer4D

A Delphi implementation of the BERT tokenizer — TBertTokenizer — inspired by the .NET FastBertTokenizer. It supports WordPiece tokenization and converts text into token ID sequences ready for input into BERT models via ONNX.

📦 Features

Load vocabulary from file (vocab.txt) or stream
Load tokenizer configuration from Hugging Face (tokenizer.json)
Converts raw text into token ID arrays
Decodes token IDs back into readable text
Compatible with TONNXRuntime for ONNX inference

📁 Project Structure

Main unit: Src/BertTokenizer/BertTokenizer.pas
Core class: TBertTokenizer

🚀 Quick Start

uses
  BertTokenizer, BertTokenizer.Extensions;

procedure LoadTokenizerAndEncode(const APath: string);
begin
  var Tokenizer := TBertTokenizer.Create;
  try 
    Tokenizer.LoadFromHuggingFace('TaylorAI/bge-micro-v2');
    var Tokens := Tokenizer.Encode('Hello, world!');
    // TokenIds can now be passed to a model using TONNXRuntime
  finally
    Tokenizer.Free;
  end;
end;

🧠 Public API

Method	Description
`LoadVocabulary(FileName, ...)`	Loads vocabulary from a `vocab.txt` file
`LoadVocabularyFromStream(Stream, ...)`	Loads vocabulary from a stream
`LoadTokenizerJson(FileName)`	Loads tokenizer from a Hugging Face `tokenizer.json`
`LoadTokenizerJsonFromStream(Stream)`	Loads tokenizer from a JSON stream
`Encode(Text)`	Tokenizes input text and returns an array of token IDs
`Decode(Tokens)`	Decodes an array of token IDs back to text
uses BertTokenizer.Extensions
`LoadFromHuggingFace(HuggingFaceRepo)`	Loads tokenizer from Hugging Face repo by name

✅ Supported Tokenizer Formats

vocab.txt — standard BERT vocabulary file
tokenizer.json — Hugging Face tokenizer format (WordPiece-based)

✅ Dependencies

Delphi 10.2 Tokyo+
No external dependencies required for core functionality
Optional test suite using DUnitX

🧪 Test Coverage

This project includes unit tests using DUnitX.

🤖 BERT + ONNX Integration

The output of Encode is compatible with ONNX BERT models. You can use TONNXRuntime to run inference on tokenized input (input_ids).

📄 License

MIT License — free to use, modify, and distribute.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Src		Src
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boss-lock.json		boss-lock.json
boss.json		boss.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BertTokenizer4D

📦 Features

📁 Project Structure

🚀 Quick Start

🧠 Public API

✅ Supported Tokenizer Formats

✅ Dependencies

🧪 Test Coverage

🤖 BERT + ONNX Integration

📄 License

About

Uh oh!

Releases

Packages

Languages

License

Samaliani/BertTokenizer4D

Folders and files

Latest commit

History

Repository files navigation

BertTokenizer4D

📦 Features

📁 Project Structure

🚀 Quick Start

🧠 Public API

✅ Supported Tokenizer Formats

✅ Dependencies

🧪 Test Coverage

🤖 BERT + ONNX Integration

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages