-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Hello.
I have a question about the VLBertEmbeddings class.
In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:
Line 271 in 9e52021
| text_visual_embeddings = final_feats[:, -1].repeat(1, seq_length).view(batch_size, seq_length, -1) |
Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last,
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.
Are there any reason that the last token is always used in the class?
I'm sorry if I misunderstand the way the embeddings classes work.
Thanks.
Metadata
Metadata
Assignees
Labels
No labels