-
Notifications
You must be signed in to change notification settings - Fork 123
feat: Add SigLIP model support #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add SigLIP (Sigmoid Loss for Language Image Pre-Training) model support including vision encoder, text encoder, and multimodal model. Key features: - Vision encoder with attention pooling head - Text encoder with final layer norm and projection head - Multimodal model with sigmoid-based similarity scoring - Support for both SigLIP v1 and SigLIP2 models (same architecture) The implementation includes a fix for batch broadcasting in the attention pooling head to support batch sizes > 1. Verified against HuggingFace transformers with: - google/siglip-base-patch16-224 (max diff < 8e-6) - google/siglip2-base-patch16-224 (max diff < 2e-5) - katuni4ka/tiny-random-SiglipModel (max diff < 3e-7)
- Rename Siglip to SigLip to follow MpNet naming pattern - Fix docs: to doc: typo in siglip_vision.ex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive support for the SigLIP (Sigmoid Loss for Language Image Pre-Training) model, a CLIP-like architecture that uses sigmoid loss instead of contrastive loss for improved scaling and training stability. The implementation follows established patterns from the CLIP model while incorporating SigLIP-specific architectural differences.
Key changes:
- Implements three new model modules: multimodal SigLip, text encoder SigLipText, and vision encoder SigLipVision
- SigLipVision features a distinctive multi-head attention pooling mechanism with a learned probe vector, differentiating it from CLIP's simpler pooling approach
- Adds a learnable bias parameter to similarity logits, a key architectural difference from CLIP
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
lib/bumblebee/multimodal/siglip.ex |
Implements the combined SigLIP model with text and vision encoders, including custom scale and bias layers for logit transformation |
lib/bumblebee/text/siglip_text.ex |
Implements the SigLIP text encoder with last-token pooling strategy |
lib/bumblebee/vision/siglip_vision.ex |
Implements the SigLIP vision encoder with attention pooling head and support for image classification architecture |
lib/bumblebee.ex |
Registers SigLIP model variants and maps HuggingFace model types to internal modules |
mix.exs |
Adds SigLIP modules to the project's main modules list |
test/bumblebee/multimodal/siglip_test.exs |
Adds integration test for the combined multimodal SigLIP model using a tiny test model |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| defmodule Bumblebee.Multimodal.SigLipTest do | ||
| use ExUnit.Case, async: true | ||
|
|
||
| import Bumblebee.TestHelpers | ||
|
|
||
| @moduletag model_test_tags() | ||
|
|
||
| test ":base" do | ||
| assert {:ok, %{model: model, params: params, spec: spec}} = | ||
| Bumblebee.load_model({:hf, "katuni4ka/tiny-random-SiglipModel"}) | ||
|
|
||
| assert %Bumblebee.Multimodal.SigLip{architecture: :base} = spec | ||
|
|
||
| # Image size is 30x30 for this tiny model | ||
| inputs = %{ | ||
| "input_ids" => | ||
| Nx.tensor([ | ||
| [10, 20, 30, 40, 50, 60, 70, 80, 1, 1], | ||
| [15, 25, 35, 45, 55, 65, 75, 85, 1, 1] | ||
| ]), | ||
| "pixel_values" => | ||
| Nx.concatenate([ | ||
| Nx.broadcast(0.25, {1, 30, 30, 3}), | ||
| Nx.broadcast(0.75, {1, 30, 30, 3}) | ||
| ]) | ||
| } | ||
|
|
||
| outputs = Axon.predict(model, params, inputs) | ||
|
|
||
| assert Nx.shape(outputs.logits_per_text) == {2, 2} | ||
| assert Nx.shape(outputs.logits_per_image) == {2, 2} | ||
|
|
||
| assert_all_close( | ||
| outputs.logits_per_text, | ||
| Nx.tensor([[-0.0626, -0.0771], [-0.0961, -0.1548]]), | ||
| atol: 1.0e-3 | ||
| ) | ||
|
|
||
| assert_all_close( | ||
| outputs.logits_per_image, | ||
| Nx.tensor([[-0.0626, -0.0961], [-0.0771, -0.1548]]), | ||
| atol: 1.0e-3 | ||
| ) | ||
| end | ||
| end |
Copilot
AI
Jan 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR introduces standalone text and vision encoder modules (SigLipText and SigLipVision) similar to CLIP, but only includes tests for the combined multimodal model. Following the pattern used by CLIP (which has separate test files for ClipText and ClipVision), there should be dedicated test files for Bumblebee.Text.SigLipText and Bumblebee.Vision.SigLipVision. This would ensure comprehensive test coverage for these standalone modules, particularly the :for_image_classification architecture supported by SigLipVision.
|
Forgot to tell why I'm doing SigLIP. There's a cool NSFW image detection model that uses SigLIP2: https://huggingface.co/blog/prithivMLmods/image-guard-models |
Adds support for SigLIP (Sigmoid Loss for Language Image Pre-Training), a CLIP-like model that uses sigmoid loss instead of contrastive loss.
The implementation includes:
Bumblebee.Multimodal.SigLipfor the combined text-image modelBumblebee.Text.SigLipTextfor the text encoderBumblebee.Vision.SigLipVisionfor the vision encoder (with:baseand:for_image_classificationarchitectures)SigLIP differs from CLIP mainly in the attention pooling head for the vision model and the addition of a learnable bias to the logits. The vision model uses a multi-head attention pooling mechanism with a learned probe vector.
Regarding naming: I went with
SigLip(capitalizing each word part) to follow the pattern used byMpNet.Note on the test: the tiny model used for testing comes from
katuni4ka/tiny-random-SiglipModelrather than the usualhf-internal-testingnamespace. I couldn't find a tiny SigLIP model there. The model works correctly against real HuggingFace models likegoogle/siglip-base-patch16-224.I also started working on SigLIP2 support but left it out of this PR to keep things focused. The architecture is similar but has some differences in the vision encoder (uses SwiGLU activation, RMSNorm, and 2D RoPE).