-
Notifications
You must be signed in to change notification settings - Fork 123
Add ModernBERT family #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ModernBERT family #435
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive support for ModernBERT, a recent encoder model that modernizes BERT with architectural improvements including RoPE position embeddings, alternating local/global attention, gated linear units (GeGLU), and pre-normalization. The implementation follows established patterns in the codebase and includes proper model-to-HuggingFace parameter mappings.
- Full implementation of ModernBERT model with four architectures:
:base,:for_masked_language_modeling,:for_sequence_classification, and:for_token_classification - Special attention architecture with alternating local (window-based) and global attention layers, each with distinct RoPE theta values
- Test coverage for base and MLM architectures with validation against reference outputs
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| lib/bumblebee/text/modernbert.ex | Core implementation including encoder with alternating attention patterns, gated FFN, RMS normalization, mean pooling for sequence classification, and tied embeddings for MLM head |
| lib/bumblebee/text/pre_trained_tokenizer.ex | Adds ModernBERT special token configuration (UNK, SEP, PAD, CLS, MASK) |
| lib/bumblebee.ex | Registers ModernBERT model architectures and tokenizer type mapping |
| test/bumblebee/text/modernbert_test.exs | Integration tests for :base and :for_masked_language_modeling architectures with output validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
684e256 to
af54694
Compare
af54694 to
6904c32
Compare
|
Matched the output with transformers, looks good now. Added ModernBertDecoder as well, which is the decoder variant for causal language modeling. It's a separate model in transformers (different file, different config), so I created a separate module for it. Also tested loading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @moduletag model_test_tags() | ||
|
|
||
| test ":for_causal_language_modeling" do |
Copilot
AI
Jan 1, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test file is missing a test for the :base architecture. ModernBertDecoder declares support for the :base architecture in its architectures/0 function, but there is no corresponding test case. Other decoder models in the codebase (e.g., GPT2, Llama) include tests for all their supported architectures, including :base.
Add support for ModernBERT, a modern encoder model with architectural improvements over BERT: - Rotary position embeddings (RoPE) instead of absolute position embeddings - Alternating local and global attention layers for efficiency - Gated linear units (GeGLU) in feed-forward blocks - Pre-normalization with LayerNorm (no bias) - First layer reuses embedding norm for attention Supported architectures: - :base - :for_masked_language_modeling - :for_sequence_classification - :for_token_classification Reference: https://arxiv.org/abs/2412.13663
6904c32 to
d47539c
Compare
Replace manual layer norm implementation with Axon.Layers.layer_norm, passing 0 as beta to implement layer norm without bias. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename local_rope_theta/global_rope_theta to rotary_embedding_base_local/ rotary_embedding_base to match naming convention in other models (Gemma 3).
Replace global_attention_every_n_layers with layer_types list to match transformers v5 format. Backward compatibility: generates layer_types from global_attn_every_n_layers when not present in checkpoint config.
This adds support for ModernBERT, a recent encoder model that improves on BERT with a few architectural changes:
ModernBert (Encoder)
Supported architectures:
:base:for_masked_language_modeling:for_sequence_classification:for_token_classificationThe MLM head uses tied embeddings (shares weights with the input token embeddings).
ModernBertDecoder (Decoder)
A decoder-only variant trained for causal language modeling (text generation).
Supported architectures:
:base:for_causal_language_modelingReference: https://arxiv.org/abs/2412.13663