Bug fixes to trajectory-wise trainer #60

adityasoni9998 · 2026-01-04T05:23:40Z

This fixes several bugs in non-step-wise training (some of these bugs took me several hours to catch and are quite severe and low-level which made it very difficult for me to understand why these fixes are needed, so I have tried explaining it here so that your time not spent on the same job)

Loss masking does not work correctly (I have analysed the training data at a token level using the attached analysis scripts (very painful to do but is important to catch the bug) and found the following:

The string which separates turns is \n<|im_start>assistant\n and not <|im_start>assistant so it is tokenized to 4 tokens and not two tokens (the same thing holds true for system and user roles).
Since add generation prompt is true, the string \n<|im_start>assistant\n is present in prompt by default -- now the moment we hit assistant token we assume all next tokens are generated tokens and do not mask their loss (so the 4th token in this sequence is not masked)
After the model generated <|im_end|> the chat template will append \n<|im_start>user\n on top of it -- the first token of this string is not masked in the loss since we have not yet hit <|im_start|> in the loop and we train to generate a new line after <|im_end|> which is very wrong. When I tried generating with Qwen3 and analyzed the token ids, there are no token ids present after <|im_end|>

Reward implementation bug (do not give reward = 1 if the ground truth is empty when computing F1 scores):

This bug would likely give +ve rewards for instances with formatting errors and 0 rewards for instances with correct format (I already fixed this in main last week but this was not probably pulled in major-update branch). See these lines of code
For a data point with empty ground truth at module/entity level, it may have two sorts of rollouts: those which had a format error (hence their prediction was empty) and those which did not have a format error (their prediction was non-empty). For empty predictions we have reward=1 and non-empty predictions we have reward=0 since this if condition will not trigger. These data points where entity/module-level rewards are empty are quite frequent (roughly 5-10% of data points in each batch of 256 points have this property).

Fixes SkyRL bug: NovaSky-AI/SkyRL#796 -- for now please download SkyRL from my patched code (reasoning parser will still not work but instruct models work, I am testing the fix for the reasoning parser right now)

Chat template issue (add_generation_prompt must be true else the agent will have to generate the <|im_start|>assistant tokens which is non-ideal)

Add exception handling on openhands agent to selectively get a message list. This can be helpful in future if we want to penalize trajectories that resulted in errors but not those which were terminated due to LLM context window errors. Currently all err'ed trajs are given reward 0

Fixes dataset processing to use HF datasets I prepared and makes appropriate changes to code.

Cherry-picks some fixes from main which were not merged here.

More analysis from trajectories (the earlier run without these fixes had the problem of tool-calls dropping to 0) with a sinusoidal reward curve. The reason is that the model suddenly generates tool calls with incorrect formats and as a result these are interpreted as message events (not action events) which causes the trajectory to terminate without a single tool call and 0 reward)

Some of the above fixes will help with these issues. There seems to be trajectories with incorrectly formatted trajs that were given +ve rewards and the number of such trajectories suddenly increases at some later stage of training (see below plots):

adityasoni9998 · 2026-01-05T20:07:54Z

@lintangsutawika Note that this trick of fixing the loss mask in the chat template will not work for thinking models (with either thinking enabled or disabled) and has been tested only for instruct models so far!

src/build_swe_smith.py

src/generator/code_search_generator.py

adityasoni9998 added 3 commits January 3, 2026 23:57

fix trajectory-wise trainer

d500638

analysis files

bca01e9

fix merge conflict issues

0885606

adityasoni9998 requested a review from lintangsutawika January 4, 2026 05:23

Merge branch 'major-update' into major-update-fixes

6446930

taha-yassine reviewed Jan 6, 2026

View reviewed changes

src/build_swe_smith.py Show resolved Hide resolved

src/generator/code_search_generator.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug fixes to trajectory-wise trainer #60

Bug fixes to trajectory-wise trainer #60

Uh oh!

adityasoni9998 commented Jan 4, 2026 •

edited

Loading

Uh oh!

adityasoni9998 commented Jan 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bug fixes to trajectory-wise trainer #60

Are you sure you want to change the base?

Bug fixes to trajectory-wise trainer #60

Uh oh!

Conversation

adityasoni9998 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adityasoni9998 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adityasoni9998 commented Jan 4, 2026 •

edited

Loading

adityasoni9998 commented Jan 5, 2026 •

edited

Loading