-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
The article mentions that the experiment trained a total of 3000 steps, with a training ratio of 1:5 for the student model and fake model, and a training ratio of 1:1 for the Rolling Forcing and Self Forcing methods. This way, training with Rolling Forcing only takes 300 steps. 300 iterations can distill convergence, is the reason for not continuing to train distillation: training student models later will quickly crash?
Metadata
Metadata
Assignees
Labels
No labels