Skip to content

Conversation

@georgeguimaraes
Copy link
Contributor

@georgeguimaraes georgeguimaraes commented Dec 28, 2025

This adds support for ModernBERT, a recent encoder model that improves on BERT with a few architectural changes:

  • Rotary position embeddings (RoPE) instead of absolute position embeddings
  • Alternating local and global attention layers for efficiency on longer sequences
  • Gated linear units (GeGLU) in the feed-forward blocks
  • Pre-normalization (norm before attention/FFN rather than after)
  • No bias in layer normalization

ModernBert (Encoder)

Supported architectures:

  • :base
  • :for_masked_language_modeling
  • :for_sequence_classification
  • :for_token_classification

The MLM head uses tied embeddings (shares weights with the input token embeddings).

ModernBertDecoder (Decoder)

A decoder-only variant trained for causal language modeling (text generation).

Supported architectures:

  • :base
  • :for_causal_language_modeling

Reference: https://arxiv.org/abs/2412.13663

Copilot AI review requested due to automatic review settings December 28, 2025 11:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive support for ModernBERT, a recent encoder model that modernizes BERT with architectural improvements including RoPE position embeddings, alternating local/global attention, gated linear units (GeGLU), and pre-normalization. The implementation follows established patterns in the codebase and includes proper model-to-HuggingFace parameter mappings.

  • Full implementation of ModernBERT model with four architectures: :base, :for_masked_language_modeling, :for_sequence_classification, and :for_token_classification
  • Special attention architecture with alternating local (window-based) and global attention layers, each with distinct RoPE theta values
  • Test coverage for base and MLM architectures with validation against reference outputs

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
lib/bumblebee/text/modernbert.ex Core implementation including encoder with alternating attention patterns, gated FFN, RMS normalization, mean pooling for sequence classification, and tied embeddings for MLM head
lib/bumblebee/text/pre_trained_tokenizer.ex Adds ModernBERT special token configuration (UNK, SEP, PAD, CLS, MASK)
lib/bumblebee.ex Registers ModernBERT model architectures and tokenizer type mapping
test/bumblebee/text/modernbert_test.exs Integration tests for :base and :for_masked_language_modeling architectures with output validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@georgeguimaraes georgeguimaraes force-pushed the feat/modernbert-implementation branch 2 times, most recently from 684e256 to af54694 Compare January 1, 2026 18:20
@georgeguimaraes georgeguimaraes changed the title Add ModernBERT model Add ModernBERT family Jan 1, 2026
@georgeguimaraes georgeguimaraes force-pushed the feat/modernbert-implementation branch from af54694 to 6904c32 Compare January 1, 2026 19:03
@georgeguimaraes
Copy link
Contributor Author

Matched the output with transformers, looks good now.

Added ModernBertDecoder as well, which is the decoder variant for causal language modeling. It's a separate model in transformers (different file, different config), so I created a separate module for it.

Also tested loading answerdotai/ModernBERT-base to make sure params mapping works with real checkpoints. No issues there.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@moduletag model_test_tags()

test ":for_causal_language_modeling" do
Copy link

Copilot AI Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test file is missing a test for the :base architecture. ModernBertDecoder declares support for the :base architecture in its architectures/0 function, but there is no corresponding test case. Other decoder models in the codebase (e.g., GPT2, Llama) include tests for all their supported architectures, including :base.

Copilot uses AI. Check for mistakes.
Add support for ModernBERT, a modern encoder model with architectural improvements over BERT:

- Rotary position embeddings (RoPE) instead of absolute position embeddings
- Alternating local and global attention layers for efficiency
- Gated linear units (GeGLU) in feed-forward blocks
- Pre-normalization with LayerNorm (no bias)
- First layer reuses embedding norm for attention

Supported architectures:
- :base
- :for_masked_language_modeling
- :for_sequence_classification
- :for_token_classification

Reference: https://arxiv.org/abs/2412.13663
@georgeguimaraes georgeguimaraes force-pushed the feat/modernbert-implementation branch from 6904c32 to d47539c Compare January 3, 2026 21:20
georgeguimaraes and others added 4 commits January 6, 2026 10:22
Replace manual layer norm implementation with Axon.Layers.layer_norm,
passing 0 as beta to implement layer norm without bias.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rename local_rope_theta/global_rope_theta to rotary_embedding_base_local/
rotary_embedding_base to match naming convention in other models (Gemma 3).
Replace global_attention_every_n_layers with layer_types list to match
transformers v5 format. Backward compatibility: generates layer_types
from global_attn_every_n_layers when not present in checkpoint config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants