Skip to content

Conversation

@georgeguimaraes
Copy link
Contributor

Adds support for SigLIP (Sigmoid Loss for Language Image Pre-Training), a CLIP-like model that uses sigmoid loss instead of contrastive loss.

The implementation includes:

  • Bumblebee.Multimodal.SigLip for the combined text-image model
  • Bumblebee.Text.SigLipText for the text encoder
  • Bumblebee.Vision.SigLipVision for the vision encoder (with :base and :for_image_classification architectures)

SigLIP differs from CLIP mainly in the attention pooling head for the vision model and the addition of a learnable bias to the logits. The vision model uses a multi-head attention pooling mechanism with a learned probe vector.

Regarding naming: I went with SigLip (capitalizing each word part) to follow the pattern used by MpNet.

Note on the test: the tiny model used for testing comes from katuni4ka/tiny-random-SiglipModel rather than the usual hf-internal-testing namespace. I couldn't find a tiny SigLIP model there. The model works correctly against real HuggingFace models like google/siglip-base-patch16-224.

I also started working on SigLIP2 support but left it out of this PR to keep things focused. The architecture is similar but has some differences in the vision encoder (uses SwiGLU activation, RMSNorm, and 2D RoPE).

Add SigLIP (Sigmoid Loss for Language Image Pre-Training) model support
including vision encoder, text encoder, and multimodal model.

Key features:
- Vision encoder with attention pooling head
- Text encoder with final layer norm and projection head
- Multimodal model with sigmoid-based similarity scoring
- Support for both SigLIP v1 and SigLIP2 models (same architecture)

The implementation includes a fix for batch broadcasting in the attention
pooling head to support batch sizes > 1.

Verified against HuggingFace transformers with:
- google/siglip-base-patch16-224 (max diff < 8e-6)
- google/siglip2-base-patch16-224 (max diff < 2e-5)
- katuni4ka/tiny-random-SiglipModel (max diff < 3e-7)
- Rename Siglip to SigLip to follow MpNet naming pattern
- Fix docs: to doc: typo in siglip_vision.ex
Copilot AI review requested due to automatic review settings January 3, 2026 16:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive support for the SigLIP (Sigmoid Loss for Language Image Pre-Training) model, a CLIP-like architecture that uses sigmoid loss instead of contrastive loss for improved scaling and training stability. The implementation follows established patterns from the CLIP model while incorporating SigLIP-specific architectural differences.

Key changes:

  • Implements three new model modules: multimodal SigLip, text encoder SigLipText, and vision encoder SigLipVision
  • SigLipVision features a distinctive multi-head attention pooling mechanism with a learned probe vector, differentiating it from CLIP's simpler pooling approach
  • Adds a learnable bias parameter to similarity logits, a key architectural difference from CLIP

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lib/bumblebee/multimodal/siglip.ex Implements the combined SigLIP model with text and vision encoders, including custom scale and bias layers for logit transformation
lib/bumblebee/text/siglip_text.ex Implements the SigLIP text encoder with last-token pooling strategy
lib/bumblebee/vision/siglip_vision.ex Implements the SigLIP vision encoder with attention pooling head and support for image classification architecture
lib/bumblebee.ex Registers SigLIP model variants and maps HuggingFace model types to internal modules
mix.exs Adds SigLIP modules to the project's main modules list
test/bumblebee/multimodal/siglip_test.exs Adds integration test for the combined multimodal SigLIP model using a tiny test model

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +45
defmodule Bumblebee.Multimodal.SigLipTest do
use ExUnit.Case, async: true

import Bumblebee.TestHelpers

@moduletag model_test_tags()

test ":base" do
assert {:ok, %{model: model, params: params, spec: spec}} =
Bumblebee.load_model({:hf, "katuni4ka/tiny-random-SiglipModel"})

assert %Bumblebee.Multimodal.SigLip{architecture: :base} = spec

# Image size is 30x30 for this tiny model
inputs = %{
"input_ids" =>
Nx.tensor([
[10, 20, 30, 40, 50, 60, 70, 80, 1, 1],
[15, 25, 35, 45, 55, 65, 75, 85, 1, 1]
]),
"pixel_values" =>
Nx.concatenate([
Nx.broadcast(0.25, {1, 30, 30, 3}),
Nx.broadcast(0.75, {1, 30, 30, 3})
])
}

outputs = Axon.predict(model, params, inputs)

assert Nx.shape(outputs.logits_per_text) == {2, 2}
assert Nx.shape(outputs.logits_per_image) == {2, 2}

assert_all_close(
outputs.logits_per_text,
Nx.tensor([[-0.0626, -0.0771], [-0.0961, -0.1548]]),
atol: 1.0e-3
)

assert_all_close(
outputs.logits_per_image,
Nx.tensor([[-0.0626, -0.0961], [-0.0771, -0.1548]]),
atol: 1.0e-3
)
end
end
Copy link

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR introduces standalone text and vision encoder modules (SigLipText and SigLipVision) similar to CLIP, but only includes tests for the combined multimodal model. Following the pattern used by CLIP (which has separate test files for ClipText and ClipVision), there should be dedicated test files for Bumblebee.Text.SigLipText and Bumblebee.Vision.SigLipVision. This would ensure comprehensive test coverage for these standalone modules, particularly the :for_image_classification architecture supported by SigLipVision.

Copilot uses AI. Check for mistakes.
@georgeguimaraes
Copy link
Contributor Author

Forgot to tell why I'm doing SigLIP. There's a cool NSFW image detection model that uses SigLIP2: https://huggingface.co/blog/prithivMLmods/image-guard-models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant