Skip to content

Conversation

@ZiyiTsang
Copy link

@ZiyiTsang ZiyiTsang commented Dec 11, 2025

What does this PR do?

The current processed logic has an issue in:

  • MUST provide eval_dataset when set predict_with_generate=true. i.e. no way to auto-splitted to get corresponding processed datasets (This may not seem important to most people, but supporting it is always beneficial)
  • In split_dataset function (src/llamafactory/data/data_utils.py), dataset_dict in L103 seems never be used
image

What I do

  • [MAIN CHANGE] In src/llamafactory/data/loader.py, move split_dataset before _get_preprocessed_dataset. In this way we can make sure eval_dataset(sure can be None) can be split and handle differently:
    • train_dataset use `SupervisedDatasetProcessor' and eval_dataset use 'UnsupervisedDatasetProcessor'
    • current code mandatory require the auto-split to have same processor, illogical.
  • [MINOR CHANGE] rewrite split_dataset to be more clearly

Before submitting

ALL TEST is PASSED.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ZiyiTsang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the data handling pipeline, particularly for evaluation datasets. It addresses limitations where eval_dataset was strictly required with predict_with_generate and streamlines the dataset splitting and preprocessing workflow. By reordering operations and refining the split_dataset function, the changes enable more flexible and appropriate processing of training and evaluation data, ultimately enhancing the robustness and usability of the data loading mechanism.

Highlights

  • Enhanced Dataset Splitting: The split_dataset function has been refactored to return separate dictionaries for training and evaluation data, improving clarity and flexibility in data handling.
  • Flexible Evaluation Dataset Handling: The system now supports automatic splitting of an evaluation dataset from the main dataset, even when predict_with_generate is enabled, removing the previous requirement for an explicit eval_dataset.
  • Improved Data Processing Order: The split_dataset operation now occurs before preprocessing, allowing for distinct processing pipelines (e.g., supervised vs. unsupervised) for training and evaluation datasets.
  • Refined Validation Logic: The argument parsing now correctly validates the presence of an evaluation dataset or val_size when predict_with_generate is active, with a clearer error message.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ZiyiTsang
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the dataset loading and splitting logic to better support evaluation with predict_with_generate=True, especially when an explicit evaluation set isn't provided. The changes move the dataset splitting to occur before preprocessing, which allows for different preprocessing steps for training and evaluation data. The split_dataset function has been cleaned up, fixing a redundant call and improving clarity. My review includes suggestions to improve an error message for better user experience and to make the dictionary merging logic more concise. Overall, the changes are a good improvement to the flexibility and correctness of the data handling pipeline.

@ZiyiTsang ZiyiTsang marked this pull request as ready for review December 11, 2025 09:20
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly refactors the dataset loading logic to support splitting an evaluation set from the training data when predict_with_generate is enabled. The changes are logical and improve the data pipeline's flexibility. The refactoring of split_dataset also fixes a latent bug. My review includes a critical fix for a potential KeyError if no training data is provided, along with suggestions to improve code conciseness and error message clarity.

@ZiyiTsang
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable enhancement by enabling the automatic splitting of a validation set from the training data when predict_with_generate is active. The core of the change involves refactoring the data loading pipeline to split datasets before preprocessing. This correctly applies different processing logic for training and evaluation sets and fixes a bug where a validation set split from training data was incorrectly preprocessed. The accompanying refactor of the split_dataset function in data_utils.py improves code clarity and fixes an inefficient double call to train_test_split. The argument validation in parser.py is also updated to align with these new capabilities. Overall, this is a solid improvement to the data handling logic. I have one suggestion to improve an error message for better user experience.

data_args.eval_dataset is None and data_args.val_size < 1e-6
):
raise ValueError("Please specify dataset for evaluation.")
raise ValueError("Please make sure eval_dataset be provided or val_size >1e-6")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error message is a bit ungrammatical and could be clearer for the user. A more explicit message would better guide them on how to configure their evaluation setup.

raise ValueError("An evaluation dataset is required. Please provide `eval_dataset` or set `val_size` to a value greater than 1e-6.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant