Skip to content

Conversation

@adityasoni9998
Copy link
Collaborator

@adityasoni9998 adityasoni9998 commented Jan 4, 2026

This fixes several bugs in non-step-wise training (some of these bugs took me several hours to catch and are quite severe and low-level which made it very difficult for me to understand why these fixes are needed, so I have tried explaining it here so that your time not spent on the same job)

Loss masking does not work correctly (I have analysed the training data at a token level using the attached analysis scripts (very painful to do but is important to catch the bug) and found the following:

  1. The string which separates turns is \n<|im_start>assistant\n and not <|im_start>assistant so it is tokenized to 4 tokens and not two tokens (the same thing holds true for system and user roles).
  2. Since add generation prompt is true, the string \n<|im_start>assistant\n is present in prompt by default -- now the moment we hit assistant token we assume all next tokens are generated tokens and do not mask their loss (so the 4th token in this sequence is not masked)
  3. After the model generated <|im_end|> the chat template will append \n<|im_start>user\n on top of it -- the first token of this string is not masked in the loss since we have not yet hit <|im_start|> in the loop and we train to generate a new line after <|im_end|> which is very wrong. When I tried generating with Qwen3 and analyzed the token ids, there are no token ids present after <|im_end|>

Reward implementation bug (do not give reward = 1 if the ground truth is empty when computing F1 scores):

  1. This bug would likely give +ve rewards for instances with formatting errors and 0 rewards for instances with correct format (I already fixed this in main last week but this was not probably pulled in major-update branch). See these lines of code
  2. For a data point with empty ground truth at module/entity level, it may have two sorts of rollouts: those which had a format error (hence their prediction was empty) and those which did not have a format error (their prediction was non-empty). For empty predictions we have reward=1 and non-empty predictions we have reward=0 since this if condition will not trigger. These data points where entity/module-level rewards are empty are quite frequent (roughly 5-10% of data points in each batch of 256 points have this property).

Fixes SkyRL bug: NovaSky-AI/SkyRL#796 -- for now please download SkyRL from my patched code (reasoning parser will still not work but instruct models work, I am testing the fix for the reasoning parser right now)

Chat template issue (add_generation_prompt must be true else the agent will have to generate the <|im_start|>assistant tokens which is non-ideal)

Add exception handling on openhands agent to selectively get a message list. This can be helpful in future if we want to penalize trajectories that resulted in errors but not those which were terminated due to LLM context window errors. Currently all err'ed trajs are given reward 0

Fixes dataset processing to use HF datasets I prepared and makes appropriate changes to code.

Cherry-picks some fixes from main which were not merged here.

More analysis from trajectories (the earlier run without these fixes had the problem of tool-calls dropping to 0) with a sinusoidal reward curve. The reason is that the model suddenly generates tool calls with incorrect formats and as a result these are interpreted as message events (not action events) which causes the trajectory to terminate without a single tool call and 0 reward)

image

Some of the above fixes will help with these issues. There seems to be trajectories with incorrectly formatted trajs that were given +ve rewards and the number of such trajectories suddenly increases at some later stage of training (see below plots):

stat1_bad_files stat4_groups_all_bad stat3_groups_any_bad stat2_bad_files_reward_gt_zero

@adityasoni9998
Copy link
Collaborator Author

adityasoni9998 commented Jan 5, 2026

@lintangsutawika Note that this trick of fixing the loss mask in the chat template will not work for thinking models (with either thinking enabled or disabled) and has been tested only for instruct models so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants