question about the first training stage

hello authors, thanks for your open-source code, and I have learned a lot.

I have a question about the first stage of training pipeline:
https://github.com/skleee/GLEN/blob/d165ef8a59f6c09e9cbc17a0a454284bf09303c2/src/tevatron/modeling/glen_phase1.py#L88-L108
According formula 4 in the paper, the sequence-to-sequence cross-entropy loss should use the teacher-forcing style training, but the code shown above demonstrate the model generate docid token by token and compute loss (without teacher-forcing).

Are there any considerations for this type of training?



	for i in range(max_output_len_in_batch):
	decoder_attention_mask_cur = decoder_attention_mask[:, : i + 1]
	cur_out = self.hf_model(
	input_ids=input_ids,
	attention_mask=attention_mask,
	decoder_inputs_embeds=decoder_inputs_embeds,
	decoder_attention_mask=decoder_attention_mask_cur,
	encoder_outputs=encoder_outputs,
	past_key_values=past_key_values,
	output_hidden_states=True,
	use_cache=True,
	return_dict=True,
	)
	if encoder_outputs is None:
	encoder_outputs = cur_out.encoder_outputs
	past_key_values = cur_out.past_key_values
	decoder_inputs_embeds = cur_out.decoder_hidden_states[-1][
	:, -1:, :
	] # [B, 1, H]
	lm_logits.append(cur_out.logits[:, -1, :])
	sequence_output.append(cur_out.decoder_hidden_states[-1][:, -1:, :])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

question about the first training stage #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

question about the first training stage #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions