Skip to content

Conversation

@yilinjz
Copy link

@yilinjz yilinjz commented Jan 4, 2026

Implementation of format reward as described in Issue #52.

Current training logs revealed three specific structural failures correlated with the reward drop:

  1. The model fails to close tags.
  2. The model attempts to call tools (e.g., search_code) in the final turn instead of providing an answer.
  3. The model hallucinates the raw <|endoftext|> string.

Thus, we introduce a positive-only format reward to guide the model back to structural stability.

@yilinjz yilinjz mentioned this pull request Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant