Support split eval_dataset when explict set "predict_with_generate" #9604

ZiyiTsang · 2025-12-11T09:18:49Z

What does this PR do?

The current processed logic has an issue in:

MUST provide eval_dataset when set predict_with_generate=true. i.e. no way to auto-splitted to get corresponding processed datasets (This may not seem important to most people, but supporting it is always beneficial)
In split_dataset function (src/llamafactory/data/data_utils.py), dataset_dict in L103 seems never be used

What I do

[MAIN CHANGE] In src/llamafactory/data/loader.py, move split_dataset before _get_preprocessed_dataset. In this way we can make sure eval_dataset(sure can be None) can be split and handle differently:
- train_dataset use `SupervisedDatasetProcessor' and eval_dataset use 'UnsupervisedDatasetProcessor'
- current code mandatory require the auto-split to have same processor, illogical.
[MINOR CHANGE] rewrite split_dataset to be more clearly

Before submitting

Did you read the contributor guideline?
Did you write any new necessary tests?

ALL TEST is PASSED.

gemini-code-assist · 2025-12-11T09:19:07Z

Summary of Changes

Hello @ZiyiTsang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the data handling pipeline, particularly for evaluation datasets. It addresses limitations where eval_dataset was strictly required with predict_with_generate and streamlines the dataset splitting and preprocessing workflow. By reordering operations and refining the split_dataset function, the changes enable more flexible and appropriate processing of training and evaluation data, ultimately enhancing the robustness and usability of the data loading mechanism.

Highlights

Enhanced Dataset Splitting: The split_dataset function has been refactored to return separate dictionaries for training and evaluation data, improving clarity and flexibility in data handling.
Flexible Evaluation Dataset Handling: The system now supports automatic splitting of an evaluation dataset from the main dataset, even when predict_with_generate is enabled, removing the previous requirement for an explicit eval_dataset.
Improved Data Processing Order: The split_dataset operation now occurs before preprocessing, allowing for distinct processing pipelines (e.g., supervised vs. unsupervised) for training and evaluation datasets.
Refined Validation Logic: The argument parsing now correctly validates the presence of an evaluation dataset or val_size when predict_with_generate is active, with a clearer error message.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

ZiyiTsang · 2025-12-11T09:19:27Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the dataset loading and splitting logic to better support evaluation with predict_with_generate=True, especially when an explicit evaluation set isn't provided. The changes move the dataset splitting to occur before preprocessing, which allows for different preprocessing steps for training and evaluation data. The split_dataset function has been cleaned up, fixing a redundant call and improving clarity. My review includes suggestions to improve an error message for better user experience and to make the dictionary merging logic more concise. Overall, the changes are a good improvement to the flexibility and correctness of the data handling pipeline.

src/llamafactory/data/loader.py

src/llamafactory/hparams/parser.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist

Code Review

This pull request correctly refactors the dataset loading logic to support splitting an evaluation set from the training data when predict_with_generate is enabled. The changes are logical and improve the data pipeline's flexibility. The refactoring of split_dataset also fixes a latent bug. My review includes a critical fix for a potential KeyError if no training data is provided, along with suggestions to improve code conciseness and error message clarity.

src/llamafactory/data/loader.py

src/llamafactory/hparams/parser.py

ZiyiTsang · 2025-12-11T10:06:16Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable enhancement by enabling the automatic splitting of a validation set from the training data when predict_with_generate is active. The core of the change involves refactoring the data loading pipeline to split datasets before preprocessing. This correctly applies different processing logic for training and evaluation sets and fixes a bug where a validation set split from training data was incorrectly preprocessed. The accompanying refactor of the split_dataset function in data_utils.py improves code clarity and fixes an inefficient double call to train_test_split. The argument validation in parser.py is also updated to align with these new capabilities. Overall, this is a solid improvement to the data handling logic. I have one suggestion to improve an error message for better user experience.

gemini-code-assist · 2025-12-11T10:07:37Z

src/llamafactory/hparams/parser.py

        data_args.eval_dataset is None and data_args.val_size < 1e-6
    ):
-        raise ValueError("Please specify dataset for evaluation.")
+        raise ValueError("Please make sure eval_dataset be provided or val_size >1e-6")


The error message is a bit ungrammatical and could be clearer for the user. A more explicit message would better guide them on how to configure their evaluation setup.

raise ValueError("An evaluation dataset is required. Please provide `eval_dataset` or set `val_size` to a value greater than 1e-6.")

[fix] clear processing logic

ea580ea

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

src/llamafactory/data/loader.py Outdated Show resolved Hide resolved

src/llamafactory/hparams/parser.py Show resolved Hide resolved

ZiyiTsang marked this pull request as ready for review December 11, 2025 09:20

Update src/llamafactory/data/loader.py

7822a26

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

src/llamafactory/data/loader.py Outdated Show resolved Hide resolved

src/llamafactory/data/loader.py Show resolved Hide resolved

src/llamafactory/hparams/parser.py Show resolved Hide resolved

ZiyiTsang added 2 commits December 11, 2025 09:22

.

4704170

.

c2c4ddc

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support split eval_dataset when explict set "predict_with_generate" #9604

Support split eval_dataset when explict set "predict_with_generate" #9604

ZiyiTsang commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Uh oh!

ZiyiTsang commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZiyiTsang commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support split eval_dataset when explict set "predict_with_generate" #9604

Are you sure you want to change the base?

Support split eval_dataset when explict set "predict_with_generate" #9604

Conversation

ZiyiTsang commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What I do

Before submitting

Uh oh!

gemini-code-assist bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

ZiyiTsang commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZiyiTsang commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZiyiTsang commented Dec 11, 2025 •

edited

Loading