Kénotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Kénotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
- Simplicity: Kénotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
- Scalability: Kénotron uses the latest techniques to train models more efficiently at scale.
- Speed: This version of Nanotron focuses on HPC-oriented optimizations, typically made available via C++ extensions.
We recommend using Spack to install Kénotron.
git clone -c feature.manyFiles=true --depth=2 https://github.com/spack/spack.git
cd spack/bin
./spack repo add --name korovod https://github.com/korovod/korovod-spack-packages.git
./spack install py-kenotronSpack allows you to install a specific version e.g., py-kenotron@0.4.0 or py-kenotron@main.
Tip
It is advised to maintain a proper Spack environment to ensure reproducibility. You can find some examples in the toolchains directory
To install an extension, simply use the corresponding Spack variant:
./spack install py-kenotron +datastates +nanosetsSome examples are shipped into Kénotron, with some of them requiring the installation of a Spack variant:
| Variant | Description | Docs | Spack variant |
|---|---|---|---|
datastates |
Asynchronous checkpointing | Docs | py-kenotron +datastates |
nanosets |
Use the datatrove library to load data | Docs | py-kenotron +nanosets |
custom-dataloader |
Plug a custom dataloader to Kénotron | Docs | py-kenotron |
doremi |
Use DoReMi to speed up training | Docs | py-kenotron |
mamba |
Train an example Mamba model | Docs | py-kenotron |
moe |
Train an example Mixture-of-Experts (MoE) model | Docs | py-kenotron |
mup |
Use spectral µTransfer to scale up your model | Docs | py-kenotron |
s3 |
For automatically uploading checkpoints to S3 | Docs | py-kenotron |
Before building a container, you need to define a Spack environment in toolchains/<your_toolchain>/spack.yaml. Please read about Spack environments and take inspiration from existing toolchains. Spack will generate your Dockerfile for you.
cd kenotron/toolchains/<your_toolchain>
vim spack.yaml
./spack containerize > Dockerfile
docker build -t kenotron .As we are doing HPC (and we are serious about it 😛), we prefer using Apptainer (singulariry) over Docker. To use Apptainer, add the following acontainer: format: key to your spack.yaml:
spack:
container:
format: singularity
images:
os: "ubuntu:24.04"
spack: latest
strip: trueYou can then build the SIF image using ./spack containerize > kenotron.def && apptainer build kenotron.def kenotron.sif.
First, have a look at the Ultrascale Playbook, a comprehensive guide to efficiently scale LLM training with Nanotron. Everything in this guide applies to Kénotron.
A good starting point is to understand the memory usage from model configurations. The Nanotron team created a tool for this purpose. Just paste your YAML configuration to generate memory diagrams.
The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the checkpoints directory as specified in the config file.
CUDA_DEVICE_MAX_CONNECTIONS=1 python -m torch.distributed.run --nproc_per_node=8 run_train.py --config-file examples/llama/config_tiny_llama.yamlFor detailed instructions on training your first model, check out our Your First Training guide.
For multi-node training with Slurm, see our Multi-Node Training guide.
python -m torch.distributed.run --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
# We could set a larger TP for faster generation, and a larger PP in case of very large models.We currently support the following features:
- 3D parallelism (DP+TP+PP)
- Expert parallelism for MoEs
- AFAB and 1F1B schedules for PP
- Explicit APIs for TP and PP which enables easy debugging
- ZeRO-1 optimizer
- FP32 gradient accumulation
- Parameter tying/sharding
- Custom module checkpointing for large models
- Spectral µTransfer parametrization for scaling up neural networks
- Mamba example
- Asynchronous checkpointing
And we have on our roadmap:
- Data-parallel checkpointing for reducing I/O pressure
- FSDP
-
torch.compilesupport - Interleaved 1f1b schedule
- Efficient expert parallelism
The following models are currently supported:
- Mistral 7B
- Qwen
- Llama 3.2
- Llama 3.1
- StarCoder2
We thank the Hugging Face team for their work on the original project.
Some related projects are: