-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
hello authors, thanks for your open-source code, and I have learned a lot.
I have a question about the first stage of training pipeline:
GLEN/src/tevatron/modeling/glen_phase1.py
Lines 88 to 108 in d165ef8
| for i in range(max_output_len_in_batch): | |
| decoder_attention_mask_cur = decoder_attention_mask[:, : i + 1] | |
| cur_out = self.hf_model( | |
| input_ids=input_ids, | |
| attention_mask=attention_mask, | |
| decoder_inputs_embeds=decoder_inputs_embeds, | |
| decoder_attention_mask=decoder_attention_mask_cur, | |
| encoder_outputs=encoder_outputs, | |
| past_key_values=past_key_values, | |
| output_hidden_states=True, | |
| use_cache=True, | |
| return_dict=True, | |
| ) | |
| if encoder_outputs is None: | |
| encoder_outputs = cur_out.encoder_outputs | |
| past_key_values = cur_out.past_key_values | |
| decoder_inputs_embeds = cur_out.decoder_hidden_states[-1][ | |
| :, -1:, : | |
| ] # [B, 1, H] | |
| lm_logits.append(cur_out.logits[:, -1, :]) | |
| sequence_output.append(cur_out.decoder_hidden_states[-1][:, -1:, :]) |
According formula 4 in the paper, the sequence-to-sequence cross-entropy loss should use the teacher-forcing style training, but the code shown above demonstrate the model generate docid token by token and compute loss (without teacher-forcing).
Are there any considerations for this type of training?
Metadata
Metadata
Assignees
Labels
No labels