Skip to content

Experimental fork of Nanotron, a minimalistic large language model 4D-parallelism training

License

Notifications You must be signed in to change notification settings

korovod/kenotron

Repository files navigation

⚡️ Kénotron

Pretraining models made easy

Kénotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Kénotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:

  • Simplicity: Kénotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
  • Scalability: Kénotron uses the latest techniques to train models more efficiently at scale.
  • Speed: This version of Nanotron focuses on HPC-oriented optimizations, typically made available via C++ extensions.

Installation

We recommend using Spack to install Kénotron.

git clone -c feature.manyFiles=true --depth=2 https://github.com/spack/spack.git
cd spack/bin
./spack repo add --name korovod https://github.com/korovod/korovod-spack-packages.git
./spack install py-kenotron

Spack allows you to install a specific version e.g., py-kenotron@0.4.0 or py-kenotron@main.

Tip

It is advised to maintain a proper Spack environment to ensure reproducibility. You can find some examples in the toolchains directory

Extensions

To install an extension, simply use the corresponding Spack variant:

./spack install py-kenotron +datastates +nanosets

Some examples are shipped into Kénotron, with some of them requiring the installation of a Spack variant:

Variant Description Docs Spack variant
datastates Asynchronous checkpointing Docs py-kenotron +datastates
nanosets Use the datatrove library to load data Docs py-kenotron +nanosets
custom-dataloader Plug a custom dataloader to Kénotron Docs py-kenotron
doremi Use DoReMi to speed up training Docs py-kenotron
mamba Train an example Mamba model Docs py-kenotron
moe Train an example Mixture-of-Experts (MoE) model Docs py-kenotron
mup Use spectral µTransfer to scale up your model Docs py-kenotron
s3 For automatically uploading checkpoints to S3 Docs py-kenotron

Building a container

Before building a container, you need to define a Spack environment in toolchains/<your_toolchain>/spack.yaml. Please read about Spack environments and take inspiration from existing toolchains. Spack will generate your Dockerfile for you.

cd kenotron/toolchains/<your_toolchain>
vim spack.yaml
./spack containerize > Dockerfile
docker build -t kenotron .

As we are doing HPC (and we are serious about it 😛), we prefer using Apptainer (singulariry) over Docker. To use Apptainer, add the following acontainer: format: key to your spack.yaml:

spack:
  container:
    format: singularity
    images:
      os: "ubuntu:24.04"
      spack: latest
    strip: true

You can then build the SIF image using ./spack containerize > kenotron.def && apptainer build kenotron.def kenotron.sif.

Quick Start

First, have a look at the Ultrascale Playbook, a comprehensive guide to efficiently scale LLM training with Nanotron. Everything in this guide applies to Kénotron.

Predicting the memory that you will need

A good starting point is to understand the memory usage from model configurations. The Nanotron team created a tool for this purpose. Just paste your YAML configuration to generate memory diagrams.

Training a tiny Llama model

The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the checkpoints directory as specified in the config file.

CUDA_DEVICE_MAX_CONNECTIONS=1 python -m torch.distributed.run --nproc_per_node=8 run_train.py --config-file examples/llama/config_tiny_llama.yaml

For detailed instructions on training your first model, check out our Your First Training guide.

For multi-node training with Slurm, see our Multi-Node Training guide.

Run generation from your checkpoint

python -m torch.distributed.run --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
# We could set a larger TP for faster generation, and a larger PP in case of very large models.

Features

We currently support the following features:

  • 3D parallelism (DP+TP+PP)
  • Expert parallelism for MoEs
  • AFAB and 1F1B schedules for PP
  • Explicit APIs for TP and PP which enables easy debugging
  • ZeRO-1 optimizer
  • FP32 gradient accumulation
  • Parameter tying/sharding
  • Custom module checkpointing for large models
  • Spectral µTransfer parametrization for scaling up neural networks
  • Mamba example
  • Asynchronous checkpointing

And we have on our roadmap:

  • Data-parallel checkpointing for reducing I/O pressure
  • FSDP
  • torch.compile support
  • Interleaved 1f1b schedule
  • Efficient expert parallelism

Models

The following models are currently supported:

  • Mistral 7B
  • Qwen
  • Llama 3.2
  • Llama 3.1
  • StarCoder2

Credits

We thank the Hugging Face team for their work on the original project.

Some related projects are:

About

Experimental fork of Nanotron, a minimalistic large language model 4D-parallelism training

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 23

Languages