Skip to content

Question about training #10

@SolitaryManF

Description

@SolitaryManF

The article mentions that the experiment trained a total of 3000 steps, with a training ratio of 1:5 for the student model and fake model, and a training ratio of 1:1 for the Rolling Forcing and Self Forcing methods. This way, training with Rolling Forcing only takes 300 steps. 300 iterations can distill convergence, is the reason for not continuing to train distillation: training student models later will quickly crash?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions