-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I hope this message finds you well. I am currently studying your paper and have come across a couple of points that I would greatly appreciate some clarification on. Your insights would be invaluable to my understanding of the work. Here are my questions:
-
In the paper, it is mentioned that the model observes four distinct patterns in the loss trends for each token during training: H->H, L->L, H->L, and L->H. It is noted that, except for the L->L category, the other patterns exhibit higher average loss. To address this, the paper introduces the SLM, which uses a reference model to filter out tokens with higher loss. However, I am curious about how this approach specifically targets and removes the noisy tokens depicted in Figure 2. Intuitively, one might expect noisy tokens to have a higher loss and thus be more likely to be selected by the reference model. Could you please elaborate on the mechanism by which SLM effectively eliminates these noisy tokens?
-
For continue pre-training, I am wondering if it is feasible to use the pretrained model itself as the reference model. What are the potential implications or limitations of such an approach?