Skip to content

On the visual token added to linguistic tokens in VLBertEmbeddings class #10

@iki-taichi

Description

@iki-taichi

Hello.

I have a question about the VLBertEmbeddings class.

In its forward function, a global image feature is added into linguistic tokens
The last token in vision sequence is used as the global image feature like bellow:

text_visual_embeddings = final_feats[:, -1].repeat(1, seq_length).view(batch_size, seq_length, -1)

Using the last token seems reasonable for the original VLBert (vl-bert_base.json) because add_global_imgfeat is last,
but I think this should be the first token for the controlled VLBert (ctrl_vl-bert_base.json), whose add_global_imgfeat is first.

Are there any reason that the last token is always used in the class?

I'm sorry if I misunderstand the way the embeddings classes work.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions