Skip to content

question about the first training stage #2

@colinzhaoxp

Description

@colinzhaoxp

hello authors, thanks for your open-source code, and I have learned a lot.

I have a question about the first stage of training pipeline:

for i in range(max_output_len_in_batch):
decoder_attention_mask_cur = decoder_attention_mask[:, : i + 1]
cur_out = self.hf_model(
input_ids=input_ids,
attention_mask=attention_mask,
decoder_inputs_embeds=decoder_inputs_embeds,
decoder_attention_mask=decoder_attention_mask_cur,
encoder_outputs=encoder_outputs,
past_key_values=past_key_values,
output_hidden_states=True,
use_cache=True,
return_dict=True,
)
if encoder_outputs is None:
encoder_outputs = cur_out.encoder_outputs
past_key_values = cur_out.past_key_values
decoder_inputs_embeds = cur_out.decoder_hidden_states[-1][
:, -1:, :
] # [B, 1, H]
lm_logits.append(cur_out.logits[:, -1, :])
sequence_output.append(cur_out.decoder_hidden_states[-1][:, -1:, :])

According formula 4 in the paper, the sequence-to-sequence cross-entropy loss should use the teacher-forcing style training, but the code shown above demonstrate the model generate docid token by token and compute loss (without teacher-forcing).

Are there any considerations for this type of training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions