Question about training

The article mentions that the experiment trained a total of 3000 steps, with a training ratio of 1:5 for the student model and fake model, and a training ratio of 1:1 for the Rolling Forcing and Self Forcing methods. This way, training with Rolling Forcing only takes 300 steps. 300 iterations can distill convergence, is the reason for not continuing to train distillation: training student models later will quickly crash?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about training #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about training #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions