Skip to content

Conversation

@hiyuchang
Copy link
Collaborator

Description

As the title says.

Now it looks like

Step 0: {'eval/gsm8k-eval/finished_task_count': 1319, 'eval/gsm8k-eval/accuracy/mean@4': 0.34154662623199394, 'eval/gsm8k-eval/accuracy/std@4': 0.2833624429507763, 'eval/gsm8k-eval/accuracy/best@2/mean': 0.46970053070507956, 'eval/gsm8k-eval/accuracy/best@2/std': 0.2601951934227904, 'eval/gsm8k-eval/accuracy/worst@2/mean': 0.21339878695981807, 'eval/gsm8k-eval/accuracy/worst@2/std': 0.22732260946377386, 'eval/gsm8k-eval/accuracy/best@4/mean': 0.582379833206975, 'eval/gsm8k-eval/accuracy/best@4/std': 0.187753964514041, 'eval/gsm8k-eval/accuracy/worst@4/mean': 0.12428809704321456, 'eval/gsm8k-eval/accuracy/worst@4/std': 0.13311398785210515, 'eval/gsm8k-eval/format_score/mean@4': 0.04791508718726308, 'eval/gsm8k-eval/format_score/std@4': 0.060430579521288934, 'eval/gsm8k-eval/format_score/best@2/mean': 0.07519408642911297, 'eval/gsm8k-eval/format_score/best@2/std': 0.043823375514568795, 'eval/gsm8k-eval/format_score/worst@2/mean': 0.020637604245640644, 'eval/gsm8k-eval/format_score/worst@2/std': 0.06012139017838808, 'eval/gsm8k-eval/format_score/best@4/mean': 0.09088036391205459, 'eval/gsm8k-eval/format_score/best@4/std': 0.02081875469787524, 'eval/gsm8k-eval/format_score/worst@4/mean': -0.006707808946171337, 'eval/gsm8k-eval/format_score/worst@4/std': 0.04779745171526119, 'eval/gsm8k-eval/time/run_execution/mean': 1.8486472846342814, 'eval/gsm8k-eval/time/run_execution/std': 0.7957768482777444, 'eval/gsm8k-eval/time/task_execution/mean': 2.0765281061587504, 'eval/gsm8k-eval/time/task_execution/std': 0.8317681867952945, 'time/eval': 77.17239356040955}

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@pan-x-c
Copy link
Collaborator

pan-x-c commented Jan 20, 2026

/unittest-module-trainer

@github-actions
Copy link

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
26 18 3 5 0 0 37m 27s

Failed Tests

Failed Tests ❌ Fail Message
❌ tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer The test failed in the call phase
❌ tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer The test failed in the call phase
❌ tests/trainer/trainer_test.py::TestTrainerLoRA::test_trainer The test failed in the call phase

Skipped

Tests Status
tests/trainer/trainer_test.py::TestMultiModalGRPO::test_trainer skipped ⏭️
tests/trainer/trainer_test.py::TestMultiModalSFT::test_trainer skipped ⏭️
tests/trainer/trainer_test.py::TestTinkerTrainer::test_trainer skipped ⏭️
tests/trainer/trainer_test.py::TestTinkerTrainer::test_trainer_class skipped ⏭️
tests/trainer/trainer_test.py::AgentScopeTunerTest::test_agentscope_tuner skipped ⏭️

Tests

Test Name Status Flaky Duration
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer 3m 27s
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer 3m 43s
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer 1m 40s
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer 1m 19s
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer 47.8s
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer 54.6s
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer 59.0s
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer 1m 58s
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer 34.8s
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer 31.7s
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools 30.0s
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode 1m 31s
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode 1m 29s
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode 2m 20s
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer 2m 4s
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer 5m 13s
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer 1m 33s
tests/trainer/trainer_test.py::TestServeWithTrainer::test_serve_with_trainer 1m 47s
tests/trainer/trainer_test.py::TestMultiModalGRPO::test_trainer ⏭️ 554ms
tests/trainer/trainer_test.py::TestMultiModalSFT::test_trainer ⏭️ 548ms
tests/trainer/trainer_test.py::TestTrainerLoRA::test_trainer 2m 50s
tests/trainer/trainer_test.py::TestOverRollout::test_trainer 48.9s
tests/trainer/trainer_test.py::TestTrainerPromptTruncation::test_trainer 1m 13s
tests/trainer/trainer_test.py::TestTinkerTrainer::test_trainer ⏭️ 1ms
tests/trainer/trainer_test.py::TestTinkerTrainer::test_trainer_class ⏭️ 1ms
tests/trainer/trainer_test.py::AgentScopeTunerTest::test_agentscope_tuner ⏭️ 1ms

Github Test Reporter by CTRF 💚

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants