Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
f419da7
update response dataset unit test as example
yuki-97 Dec 16, 2025
5b03ff3
split train and val at run_grpo and response_dataset
yuki-97 Dec 16, 2025
5c804ef
update OpenMathInstruct2Dataset
yuki-97 Dec 16, 2025
98385c0
update clevr
yuki-97 Dec 16, 2025
40cb99d
update vlm datasets
yuki-97 Dec 17, 2025
0c733b0
remove clevr_cogent, always to use clevr-cogent
yuki-97 Dec 17, 2025
51bedec
remove openmathinstruct2, always to use OpenMathInstruct-2
yuki-97 Dec 17, 2025
c6a3227
update DAPOMath
yuki-97 Dec 17, 2025
012622d
update DeepScaler
yuki-97 Dec 17, 2025
e24478a
update HelpSteer3
yuki-97 Dec 17, 2025
de116b4
update squad
yuki-97 Dec 17, 2025
f052482
update tulu3
yuki-97 Dec 17, 2025
63fb083
update oasst
yuki-97 Dec 17, 2025
651d075
update oai
yuki-97 Dec 17, 2025
227ce65
lint
yuki-97 Dec 18, 2025
cc14141
pyrefly
yuki-97 Dec 18, 2025
2c014d4
update doc
yuki-97 Dec 18, 2025
3156dbc
fix unit test
yuki-97 Dec 18, 2025
9b27ffc
split run_sft and run_distillation_math (#1656)
RayenTian Dec 19, 2025
7526b4b
update run_grpo_xxx
yuki-97 Dec 19, 2025
33219fb
unify
yuki-97 Dec 19, 2025
d3eb850
fix rebase
yuki-97 Dec 19, 2025
71d23b2
use common func to support split_train_validation
yuki-97 Dec 19, 2025
db53ffb
update doc for split_validation_size
yuki-97 Dec 19, 2025
ddb6541
unify docstring
yuki-97 Dec 20, 2025
9c00f46
fix task_name in oai dataset
yuki-97 Dec 20, 2025
45c11b5
fix functional test
yuki-97 Dec 20, 2025
669bcad
use inherit
yuki-97 Dec 23, 2025
077f8d0
add default dataset config
yuki-97 Dec 22, 2025
ccc5830
update all run_xxx and recipe of response dataset to use default
yuki-97 Dec 23, 2025
ad6c830
support multiple dataset
yuki-97 Dec 22, 2025
f7ccccf
fix missing default
yuki-97 Dec 23, 2025
75f7413
support multiple dataset for other run_xxx
yuki-97 Dec 23, 2025
5835ce7
add functional test
yuki-97 Dec 23, 2025
dac1fe0
support nemor gym config
RayenTian Dec 31, 2025
c0b8cde
support run nemo-gym grpo
RayenTian Jan 2, 2026
d9836a6
unify nemo gym interaface
RayenTian Jan 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 43 additions & 38 deletions docs/guides/grpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,18 +38,34 @@ To support this, we need to know:

#### Dataset

By default, NeMo RL has support for [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py) and [DeepScaler](../../nemo_rl/data/datasets/response_datasets/deepscaler.py) datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](../../nemo_rl/data/datasets/response_datasets/squad.py), etc.), you can see the full list [here](../../nemo_rl/data/datasets/response_datasets/__init__.py).
All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.

We provide a [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with JSONL-formatted response datasets for loading datasets from local path or Hugging Face. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration:
```yaml
data:
dataset_name: ResponseDataset
train_data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
val_data_path: <PathToValidationDataset>
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
train_split: <TrainSplit>, default is None # used for HuggingFace datasets
val_split: <ValSplit>, default is None # used for HuggingFace datasets
train:
dataset_name: ResponseDataset
data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
split: <TrainSplit>, default is None # used for HuggingFace datasets
split_validation_size: 0.05 # use 5% of the training data as validation data
validation:
dataset_name: ResponseDataset
data_path: <PathToValidationDataset>
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
split: <ValidationSplit>, default is None # used for HuggingFace datasets
```

We support using a single dataset for both train and validation by using `split_validation_size` to set the ratio of validation.
[OpenAssistant](../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature.
If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py).
```python
# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation
self.val_dataset = None
self.split_train_validation(split_validation_size, seed)
```

#### Common Data Format
Expand Down Expand Up @@ -99,21 +115,15 @@ We have an example of this as `math_data_processor` in [processors.py](../../nem
Example (simplified):

```python
# task_spec
default_task_spec = TaskDataSpec(
task_name="math_default",
prompt_file=data_config["prompt_file"],
system_prompt_file=data_config["system_prompt_file"],
)

task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
lambda: (default_task_spec, math_hf_data_processor)
)

# Resolve task_name from dataset or spec
task_spec = data.task_spec
task_name = data.task_name
assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)
# task_data_processors
task_data_processors = {data.task_name: (data.task_spec, data.processor)}
```

#### Putting It All Together
Expand All @@ -139,39 +149,34 @@ default_task_spec = TaskDataSpec(
system_prompt_file=data_config["system_prompt_file"],
)

# 3) Define default processor mapping
task_data_processors: dict[str, tuple[TaskDataSpec, TaskDataProcessFnCallable]] = defaultdict(
lambda: (default_task_spec, math_hf_data_processor)
)
# 3) Load dataset using the helper (built-ins or local/HF datasets)
data = load_response_dataset(data_config["train"], seed)

# 4) Load dataset using the helper (built-ins or local/HF datasets)
data = load_response_dataset(data_config, seed)
# 4) Build task_data_processors mapping
task_data_processors = {data.task_name: (data.task_spec, data.processor)}

# 5) Resolve task spec/name and ensure dataset provides a processor
task_spec = data.task_spec
task_name = data.task_name
assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)

# 6) Construct processed datasets (train and optional validation)
# 5) Construct processed dataset
dataset = AllTaskProcessedDataset(
data.formatted_ds["train"],
data.dataset,
tokenizer,
default_task_spec,
task_data_processors,
max_seq_length=data_config["max_input_seq_length"],
)
val_dataset = (
AllTaskProcessedDataset(
data.formatted_ds["validation"],

# 6) Do the same thing for validation dataset if it exists
if data_config["validation"] is not None:
val_data = load_response_dataset(data_config["validation"], seed)

val_task_data_processors = {val_data.task_name: (val_data.task_spec, val_data.processor)}

val_dataset = AllTaskProcessedDataset(
val_data.dataset,
tokenizer,
default_task_spec,
task_data_processors,
val_task_data_processors,
max_seq_length=data_config["max_input_seq_length"],
)
if data.formatted_ds["validation"]
else None
)
```

Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples.
Expand Down
52 changes: 35 additions & 17 deletions docs/guides/sft.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is e
SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](../design-docs/chat-datasets.md) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [response_datasets/squad.py](../../nemo_rl/data/datasets/response_datasets/squad.py) has an example:

```python
def format_squad(data):
def format_data(self, data: dict[str, Any]) -> dict[str, Any]:
return {
"messages": [
{
Expand Down Expand Up @@ -71,18 +71,34 @@ NeMo RL SFT uses HuggingFace chat templates to format the individual examples. T
custom_template: "{% for message in messages %}{%- if message['role'] == 'system' %}{{'Context: ' + message['content'].strip()}}{%- elif message['role'] == 'user' %}{{' Question: ' + message['content'].strip() + ' Answer: '}}{%- elif message['role'] == 'assistant' %}{{message['content'].strip()}}{%- endif %}{% endfor %}"
```

By default, NeMo RL has support for [OpenAssistant](../../nemo_rl/data/datasets/response_datasets/oasst.py), [Squad](../../nemo_rl/data/datasets/response_datasets/squad.py) and [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py) datasets. All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](../../nemo_rl/data/datasets/response_datasets/squad.py), etc.), you can see the full list [here](../../nemo_rl/data/datasets/response_datasets/__init__.py).
All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.

We provide a [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with jsonl-formatted response datasets for loading datasets from local path or HuggingFace. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration:
```yaml
data:
dataset_name: ResponseDataset
train_data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
val_data_path: <PathToValidationDataset>
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
train_split: <TrainSplit>, default is None # used for HuggingFace datasets
val_split: <ValSplit>, default is None # used for HuggingFace datasets
train:
dataset_name: ResponseDataset
data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
split: <TrainSplit>, default is None # used for HuggingFace datasets
split_validation_size: 0.05 # use 5% of the training data as validation data
validation:
dataset_name: ResponseDataset
data_path: <PathToValidationDataset>
input_key: <QuestionKey>, default is "input"
output_key: <AnswerKey>, default is "output"
split: <ValidationSplit>, default is None # used for HuggingFace datasets
```

We support using a single dataset for both train and validation by using `split_validation_size` to set the ratio of validation.
[OpenAssistant](../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature.
If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](../../nemo_rl/data/datasets/response_datasets/response_dataset.py).
```python
# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation
self.val_dataset = None
self.split_train_validation(split_validation_size, seed)
```

### OpenAI Format Datasets (with Tool Calling Support)
Expand All @@ -95,14 +111,16 @@ To use an OpenAI format dataset, configure your YAML as follows:

```yaml
data:
dataset_name: openai_format
train_data_path: "/path/to/train.jsonl" # Path to training data
val_data_path: "/path/to/val.jsonl" # Path to validation data
chat_key: "messages" # Key for messages in the data (default: "messages")
system_key: null # Key for system message in the data (optional)
system_prompt: null # Default system prompt if not in data (optional)
tool_key: "tools" # Key for tools in the data (default: "tools")
use_preserving_dataset: false # Set to true for heterogeneous tool schemas (see below)
train:
dataset_name: openai_format
data_path: <PathToTrainingDataset> # Path to training data
chat_key: "messages" # Key for messages in the data (default: "messages")
system_key: null # Key for system message in the data (optional)
system_prompt: null # Default system prompt if not in data (optional)
tool_key: "tools" # Key for tools in the data (default: "tools")
use_preserving_dataset: false # Set to true for heterogeneous tool schemas (see below)
validation:
...
```

#### Data Format
Expand Down
21 changes: 15 additions & 6 deletions examples/configs/distillation_math.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -206,11 +206,20 @@ teacher:

data:
max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
dataset_name: "DeepScaler"
shuffle: true

# dataset
train:
dataset_name: DeepScaler
validation:
dataset_name: AIME2024
repeat: 16
# default settings for all datasets
default:
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
env_name: "math"

env:
math:
num_workers: 8
Expand All @@ -225,12 +234,12 @@ logger:
monitor_gpus: true
wandb:
project: "nemo-distillation"
name: "distillation-${data.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
name: "distillation-${data.train.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
swanlab:
project: "nemo-distillation"
name: "distillation-${data.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
name: "distillation-${data.train.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
tensorboard:
log_dir: "tb_logs-distillation-${data.dataset_name}"
log_dir: "tb_logs-distillation-${data.train.dataset_name}"
mlflow:
experiment_name: "distillation-dev"
run_name: "distillation-math-cl-logger"
Expand Down
6 changes: 3 additions & 3 deletions examples/configs/distillation_math_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -147,11 +147,11 @@ logger:
wandb_enabled: true
wandb:
project: "nemo-distillation"
name: "distillation-megatron-${data.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
name: "distillation-megatron-${data.train.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
tensorboard:
log_dir: "tb_logs-distillation-megatron-${data.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
log_dir: "tb_logs-distillation-megatron-${data.train.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
mlflow:
run_name: "distillation-math-megatron-${data.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"
run_name: "distillation-math-megatron-${data.train.dataset_name}-${teacher.model_name}-${policy.model_name}-${loss_fn.kl_type}-${distillation.topk_logits_k}"

cluster:
gpus_per_node: 8
Expand Down
31 changes: 21 additions & 10 deletions examples/configs/grpo_math_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -246,22 +246,33 @@ policy:

data:
max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
shuffle: true
num_workers: 1
processor: "math_hf_data_processor"
env_name: "math"
dataset_name: "OpenMathInstruct-2"

# dataset
train:
dataset_name: OpenMathInstruct-2
split_validation_size: 0.05 # use 5% of the training data as validation data
validation: null
# default settings for all datasets
default:
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
processor: "math_hf_data_processor"
env_name: "math"
# You can use custom response datasets for training and validation. For example:
# data:
# train:
# dataset_name: ResponseDataset
# data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
# input_key: <QuestionKey>, default is "input"
# output_key: <AnswerKey>, default is "output"
# split: <TrainSplit>, default is None # used for HuggingFace datasets
# validation:
# dataset_name: ResponseDataset
# train_data_path: <PathToTrainingDataset> # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
# val_data_path: <PathToValidationDataset>
# data_path: <PathToValidationDataset>
# input_key: <QuestionKey>, default is "input"
# output_key: <AnswerKey>, default is "output"
# train_split: <TrainSplit>, default is None # used for HuggingFace datasets
# val_split: <ValSplit>, default is None # used for HuggingFace datasets
# split: <ValidationSplit>, default is None # used for HuggingFace datasets
# See https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/grpo.md#datasets for more details.

env:
Expand Down
7 changes: 0 additions & 7 deletions examples/configs/grpo_math_1B_megatron.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -157,13 +157,6 @@ policy:
gpu_memory_utilization: 0.6
max_model_len: ${policy.max_total_sequence_length}

data:
max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
dataset_name: "OpenMathInstruct-2"
shuffle: true

env:
math:
num_workers: 8
Expand Down
26 changes: 26 additions & 0 deletions examples/configs/grpo_multiple_datasets.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# GRPO Algorithm Configuration
defaults: "grpo_math_1B.yaml"

data:
_override_: true # override the data config instead of merging with it

max_input_seq_length: ${policy.max_total_sequence_length} # upper bound, real truncation occurs at vllm.max_model_len
shuffle: true
num_workers: 1

# dataset
train:
- dataset_name: OpenMathInstruct-2
split_validation_size: 0.05
- dataset_name: DeepScaler
validation:
- dataset_name: AIME2024
repeat: 16
- dataset_name: DAPOMathAIME2024

# default settings for all datasets
default:
prompt_file: "examples/prompts/cot.txt"
system_prompt_file: null
processor: "math_hf_data_processor"
env_name: "math"
3 changes: 2 additions & 1 deletion examples/configs/grpo_rm_1B.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
defaults: "grpo_math_1B.yaml"

data:
env_name: "reward_model"
default:
env_name: "reward_model"

env:
reward_model:
Expand Down
2 changes: 1 addition & 1 deletion examples/configs/grpo_sliding_puzzle.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,4 @@ logger:
run_name: "grpo-dev-sliding_puzzle"
gpu_monitoring:
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
8 changes: 6 additions & 2 deletions examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,12 @@ policy:
enforce_eager: true
data:
max_input_seq_length: 2048
prompt_file: null
dataset_name: DAPOMath17K
train:
dataset_name: DAPOMath17K
validation:
dataset_name: DAPOMathAIME2024
default:
prompt_file: null
env:
math:
num_workers: 16
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,12 @@ policy:
async_engine: true
tensor_parallel_size: 32
data:
prompt_file: null
dataset_name: DAPOMath17K
train:
dataset_name: DAPOMath17K
validation:
dataset_name: DAPOMathAIME2024
default:
prompt_file: null
logger:
monitor_gpus: false
wandb:
Expand Down
Loading
Loading