Skip to content

How to filter out noisy data? #6

@YangZyyyy

Description

@YangZyyyy

I hope this message finds you well. I am currently studying your paper and have come across a couple of points that I would greatly appreciate some clarification on. Your insights would be invaluable to my understanding of the work. Here are my questions:

  1. In the paper, it is mentioned that the model observes four distinct patterns in the loss trends for each token during training: H->H, L->L, H->L, and L->H. It is noted that, except for the L->L category, the other patterns exhibit higher average loss. To address this, the paper introduces the SLM, which uses a reference model to filter out tokens with higher loss. However, I am curious about how this approach specifically targets and removes the noisy tokens depicted in Figure 2. Intuitively, one might expect noisy tokens to have a higher loss and thus be more likely to be selected by the reference model. Could you please elaborate on the mechanism by which SLM effectively eliminates these noisy tokens?

  2. For continue pre-training, I am wondering if it is feasible to use the pretrained model itself as the reference model. What are the potential implications or limitations of such an approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions