-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Support split eval_dataset when explict set "predict_with_generate" #9604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ZiyiTsang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the data handling pipeline, particularly for evaluation datasets. It addresses limitations where Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the dataset loading and splitting logic to better support evaluation with predict_with_generate=True, especially when an explicit evaluation set isn't provided. The changes move the dataset splitting to occur before preprocessing, which allows for different preprocessing steps for training and evaluation data. The split_dataset function has been cleaned up, fixing a redundant call and improving clarity. My review includes suggestions to improve an error message for better user experience and to make the dictionary merging logic more concise. Overall, the changes are a good improvement to the flexibility and correctness of the data handling pipeline.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly refactors the dataset loading logic to support splitting an evaluation set from the training data when predict_with_generate is enabled. The changes are logical and improve the data pipeline's flexibility. The refactoring of split_dataset also fixes a latent bug. My review includes a critical fix for a potential KeyError if no training data is provided, along with suggestions to improve code conciseness and error message clarity.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable enhancement by enabling the automatic splitting of a validation set from the training data when predict_with_generate is active. The core of the change involves refactoring the data loading pipeline to split datasets before preprocessing. This correctly applies different processing logic for training and evaluation sets and fixes a bug where a validation set split from training data was incorrectly preprocessed. The accompanying refactor of the split_dataset function in data_utils.py improves code clarity and fixes an inefficient double call to train_test_split. The argument validation in parser.py is also updated to align with these new capabilities. Overall, this is a solid improvement to the data handling logic. I have one suggestion to improve an error message for better user experience.
| data_args.eval_dataset is None and data_args.val_size < 1e-6 | ||
| ): | ||
| raise ValueError("Please specify dataset for evaluation.") | ||
| raise ValueError("Please make sure eval_dataset be provided or val_size >1e-6") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is a bit ungrammatical and could be clearer for the user. A more explicit message would better guide them on how to configure their evaluation setup.
raise ValueError("An evaluation dataset is required. Please provide `eval_dataset` or set `val_size` to a value greater than 1e-6.")
What does this PR do?
The current processed logic has an issue in:
eval_datasetwhen setpredict_with_generate=true. i.e. no way to auto-splitted to get corresponding processed datasets (This may not seem important to most people, but supporting it is always beneficial)split_datasetfunction (src/llamafactory/data/data_utils.py),dataset_dictin L103 seems never be usedWhat I do
src/llamafactory/data/loader.py, movesplit_datasetbefore_get_preprocessed_dataset. In this way we can make sure eval_dataset(sure can be None) can be split and handle differently:split_datasetto be more clearlyBefore submitting
ALL TEST is PASSED.