Bug fixes to trajectory-wise trainer #60
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes several bugs in non-step-wise training (some of these bugs took me several hours to catch and are quite severe and low-level which made it very difficult for me to understand why these fixes are needed, so I have tried explaining it here so that your time not spent on the same job)
Loss masking does not work correctly (I have analysed the training data at a token level using the attached analysis scripts (very painful to do but is important to catch the bug) and found the following:
Reward implementation bug (do not give reward = 1 if the ground truth is empty when computing F1 scores):
Fixes SkyRL bug: NovaSky-AI/SkyRL#796 -- for now please download SkyRL from my patched code (reasoning parser will still not work but instruct models work, I am testing the fix for the reasoning parser right now)
Chat template issue (add_generation_prompt must be true else the agent will have to generate the <|im_start|>assistant tokens which is non-ideal)
Add exception handling on openhands agent to selectively get a message list. This can be helpful in future if we want to penalize trajectories that resulted in errors but not those which were terminated due to LLM context window errors. Currently all err'ed trajs are given reward 0
Fixes dataset processing to use HF datasets I prepared and makes appropriate changes to code.
Cherry-picks some fixes from main which were not merged here.
More analysis from trajectories (the earlier run without these fixes had the problem of tool-calls dropping to 0) with a sinusoidal reward curve. The reason is that the model suddenly generates tool calls with incorrect formats and as a result these are interpreted as message events (not action events) which causes the trajectory to terminate without a single tool call and 0 reward)
Some of the above fixes will help with these issues. There seems to be trajectories with incorrectly formatted trajs that were given +ve rewards and the number of such trajectories suddenly increases at some later stage of training (see below plots):