InfiniTrain

A from-scratch C++ training framework for large-scale models with multi-dimensional distributed parallelism.

🚀 Quick Start

System Requirements

Hardware Requirements

Recommended: NVIDIA Ampere-class GPUs (A100/A800) or newer

Software Requirements

CUDA / NCCL: Latest stable versions
gcc / g++: Version 13+
CMake: Version 3.13+

Installation

mkdir build
cd build
cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON
make -j

Build Options:

USE_CUDA=ON

Enable CUDA backend support.
USE_NCCL=ON

Enable NCCL-based distributed communication.

Both options are optional and can be disabled for CPU-only builds.

✨ InfiniTrain Overview

✔ Support Matrix

Category	Feature	Description	Status
Model Support	GPT-2	Decoder-only Transformer language model	✔ Supported
	LLaMA 3	Modern LLaMA-family Transformer architecture	✔ Supported
	DeepSeek-V3	Large-scale MoE-based language model	🗓 Planned
Precision	Multiple Data Type	FP32, BF16	✔ Supported
	Mixed Precision	Autocast-based BF16 compute with FP32 accumulation	✔ Supported
Distributed Training	Data Parallel (DP)	Parameter-server-style data parallelism	✔ Supported
	Distributed Data Parallel (DDP)	Collective-based data parallelism	✔ Supported
	Tensor Parallelism (TP)	Intra-layer tensor sharding	✔ Supported
	Sequence Parallelism (SP)	Sequence dimension sharding	✔ Supported
	Pipeline Parallelism (PP)	GPipe, 1F1B scheduling, Virtual Pipeline (vPP)	✔ Supported
	Hybrid Parallelism	Arbitrary combination of DDP + TP + SP + PP	✔ Supported
Core Components	Multi-backend	CPU and CUDA execution backends	✔ Supported
	Multi-node Distributed Training	Distributed execution across multiple nodes	✔ Supported
	Kernel Dispatcher	Kernel registration and dynamic dispatch mechanism	✔ Supported
	Autograd	Automatic differentiation engine	✔ Supported
	Autocast	Automatic mixed precision runtime	✔ Supported
Performance Optimizations	Compute–Comm Overlap	Explicit scheduling to hide communication latency	✔ Supported
	DDP Gradient Bucketing	Deferred and bucketed gradient synchronization	✔ Supported
	ZeRO-DP	DistributedOptimizer-based ZeRO-1	🚧 In Progress
Execution Mode	Training Mode	Full forward–backward training with autograd	✔ Supported
	`no_grad` Inference	Forward-only execution without gradient tracking	✔ Supported
Debugging & Tooling	Built-in Profiler	Kernel-level performance profiling	✔ Supported
	Automated Benchmarking	One-click execution, log analysis and Feishu export	✔ Supported

🏋️ Training

Each model in the example/ directory is compiled into an independent executable.
For example, the llama3 example produces a binary named llama3.

To view available runtime options:

./llama3 --help

Getting Started

The following examples demonstrate LLaMA 3 supervised fine-tuning (SFT) using InfiniTrain.

Single-node Training Example

./llama3 \
  --device cuda \
  --input_bin [training_data_path] \
  --llmc_filepath [model_path] \
  --num_iteration 10

Multi-nodes Training Example (3D parallel)

./infini_run \
  --nnodes=2 \
  --nproc_per_node=1 \
  --node_rank=[rank_id] \
  -- ./llama3 \
     --device cuda \
     --input_bin [training_data_path] \
     --llmc_filepath [model_path] \
     --num_iteration 10 \
     --nthread_per_process 8 \
     --batch_size 40 \
     --total_batch_size 10240 \
     --tensor_parallel 2 \
     --pipeline_parallel 2 \
     --sequence_parallel

Parallelism Strategies

Distributed Data Parallelism (DDP)

--nthread_per_process 8 	# ddp_size = nthread_per_process / (tensor_parallel × pipeline_parallel)

Tensor Parallelism (TP)

--tensor_parallel 4        # 4-way tensor parallelism
--sequence_parallel        # Enable sequence parallelism (requires TP > 1)

Pipeline Parallelism (PP)

--pipeline_parallel 8     		# 8 pipeline stages
--virtual_pipeline_parallel 4  	# Virtual pipeline for better load balancing

Combining Parallelism Strategies

Multiple parallelism strategies (DDP, TP, SP, PP) can be freely combined to scale training across devices and nodes.

🗺 Roadmap

2025/03/10 — InfiniTrain v0.1.0

Initial framework prototype with MNIST CPU training.
2025/04/30 — InfiniTrain v0.3.0

Added Autograd support and GPT-2 training on CPU/CUDA.
2025/07/09 — InfiniTrain v0.4.0

Introduced kernel registration, LLaMA training on CPU/CUDA, BF16 precision, and Data Parallelism.
2025/12/31 — InfiniTrain v0.5.0

Added Autocast, multi-dimensional distributed parallelism (DDP, TP, SP, PP with GPipe / 1F1B / vPP), multi-node training, no_grad mode, and communication–computation overlap with bucketed gradient synchronization.

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
.github/workflows		.github/workflows
cmake		cmake
example		example
infini_train		infini_train
scripts		scripts
third_party		third_party
tools/infini_run		tools/infini_run
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InfiniTrain

🚀 Quick Start

System Requirements

Hardware Requirements

Software Requirements

Installation

✨ InfiniTrain Overview

✔ Support Matrix

🏋️ Training

Getting Started

Single-node Training Example

Multi-nodes Training Example (3D parallel)

Parallelism Strategies

Distributed Data Parallelism (DDP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Combining Parallelism Strategies

🗺 Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

InfiniTensor/InfiniTrain

Folders and files

Latest commit

History

Repository files navigation

InfiniTrain

🚀 Quick Start

System Requirements

Hardware Requirements

Software Requirements

Installation

✨ InfiniTrain Overview

✔ Support Matrix

🏋️ Training

Getting Started

Single-node Training Example

Multi-nodes Training Example (3D parallel)

Parallelism Strategies

Distributed Data Parallelism (DDP)

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Combining Parallelism Strategies

🗺 Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages