[#52] Implementation of format reward #59

yilinjz · 2026-01-04T00:10:21Z

Implementation of format reward as described in Issue #52.

Current training logs revealed three specific structural failures correlated with the reward drop:

The model fails to close tags.
The model attempts to call tools (e.g., search_code) in the final turn instead of providing an answer.
The model hallucinates the raw <|endoftext|> string.

Thus, we introduce a positive-only format reward to guide the model back to structural stability.

yilinjz added 4 commits December 30, 2025 18:34

format reward implementation

35878f0

format reward implementation

7262f5c

format reward implementation

15eb6ef

format reward implementation

f164ce0

yilinjz mentioned this pull request Jan 4, 2026

Implement format reward #52

Open

Provide feedback