Skip to content

Question about the attention's scale factor in mup #10

@breaddaerb

Description

@breaddaerb

Hi, thanks for the great work. In mup, it seems that the scale factor in attention should be 1/head_dim. Is there a part of the code that I might have missed which could explain this discrepancy?

scale=1 / self.head_dim**0.5,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions