Skip to content

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 10, 2024

The mistral model adapted to run Starcoder 2:

  • Use layer norm (RMS still available as option)
  • Use standard MLP (gated still available as option)
  • Add back biases (optional)
  • Change (default?) tokenizer class

Missing for starcoder 1 (do we want to support it?):

  • Absolute position embeddings

Other notes:

  • Has less entries in modeling auto than gpt bigcode (3 instead of 6), probably doesn't matter
  • Using repeat for kv cache in flash attn, might not be necessary.

Still got a bunch of minor things to do (see todos)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants