feat: Add SigLIP model support #438

georgeguimaraes · 2026-01-03T16:49:24Z

Adds support for SigLIP (Sigmoid Loss for Language Image Pre-Training), a CLIP-like model that uses sigmoid loss instead of contrastive loss.

The implementation includes:

Bumblebee.Multimodal.SigLip for the combined text-image model
Bumblebee.Text.SigLipText for the text encoder
Bumblebee.Vision.SigLipVision for the vision encoder (with :base and :for_image_classification architectures)

SigLIP differs from CLIP mainly in the attention pooling head for the vision model and the addition of a learnable bias to the logits. The vision model uses a multi-head attention pooling mechanism with a learned probe vector.

Regarding naming: I went with SigLip (capitalizing each word part) to follow the pattern used by MpNet.

Note on the test: the tiny model used for testing comes from katuni4ka/tiny-random-SiglipModel rather than the usual hf-internal-testing namespace. I couldn't find a tiny SigLIP model there. The model works correctly against real HuggingFace models like google/siglip-base-patch16-224.

I also started working on SigLIP2 support but left it out of this PR to keep things focused. The architecture is similar but has some differences in the vision encoder (uses SwiGLU activation, RMSNorm, and 2D RoPE).

Add SigLIP (Sigmoid Loss for Language Image Pre-Training) model support including vision encoder, text encoder, and multimodal model. Key features: - Vision encoder with attention pooling head - Text encoder with final layer norm and projection head - Multimodal model with sigmoid-based similarity scoring - Support for both SigLIP v1 and SigLIP2 models (same architecture) The implementation includes a fix for batch broadcasting in the attention pooling head to support batch sizes > 1. Verified against HuggingFace transformers with: - google/siglip-base-patch16-224 (max diff < 8e-6) - google/siglip2-base-patch16-224 (max diff < 2e-5) - katuni4ka/tiny-random-SiglipModel (max diff < 3e-7)

- Rename Siglip to SigLip to follow MpNet naming pattern - Fix docs: to doc: typo in siglip_vision.ex

Copilot

Pull request overview

This PR adds comprehensive support for the SigLIP (Sigmoid Loss for Language Image Pre-Training) model, a CLIP-like architecture that uses sigmoid loss instead of contrastive loss for improved scaling and training stability. The implementation follows established patterns from the CLIP model while incorporating SigLIP-specific architectural differences.

Key changes:

Implements three new model modules: multimodal SigLip, text encoder SigLipText, and vision encoder SigLipVision
SigLipVision features a distinctive multi-head attention pooling mechanism with a learned probe vector, differentiating it from CLIP's simpler pooling approach
Adds a learnable bias parameter to similarity logits, a key architectural difference from CLIP

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`lib/bumblebee/multimodal/siglip.ex`	Implements the combined SigLIP model with text and vision encoders, including custom scale and bias layers for logit transformation
`lib/bumblebee/text/siglip_text.ex`	Implements the SigLIP text encoder with last-token pooling strategy
`lib/bumblebee/vision/siglip_vision.ex`	Implements the SigLIP vision encoder with attention pooling head and support for image classification architecture
`lib/bumblebee.ex`	Registers SigLIP model variants and maps HuggingFace model types to internal modules
`mix.exs`	Adds SigLIP modules to the project's main modules list
`test/bumblebee/multimodal/siglip_test.exs`	Adds integration test for the combined multimodal SigLIP model using a tiny test model

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-03T16:52:51Z

test/bumblebee/multimodal/siglip_test.exs

+defmodule Bumblebee.Multimodal.SigLipTest do
+  use ExUnit.Case, async: true
+
+  import Bumblebee.TestHelpers
+
+  @moduletag model_test_tags()
+
+  test ":base" do
+    assert {:ok, %{model: model, params: params, spec: spec}} =
+             Bumblebee.load_model({:hf, "katuni4ka/tiny-random-SiglipModel"})
+
+    assert %Bumblebee.Multimodal.SigLip{architecture: :base} = spec
+
+    # Image size is 30x30 for this tiny model
+    inputs = %{
+      "input_ids" =>
+        Nx.tensor([
+          [10, 20, 30, 40, 50, 60, 70, 80, 1, 1],
+          [15, 25, 35, 45, 55, 65, 75, 85, 1, 1]
+        ]),
+      "pixel_values" =>
+        Nx.concatenate([
+          Nx.broadcast(0.25, {1, 30, 30, 3}),
+          Nx.broadcast(0.75, {1, 30, 30, 3})
+        ])
+    }
+
+    outputs = Axon.predict(model, params, inputs)
+
+    assert Nx.shape(outputs.logits_per_text) == {2, 2}
+    assert Nx.shape(outputs.logits_per_image) == {2, 2}
+
+    assert_all_close(
+      outputs.logits_per_text,
+      Nx.tensor([[-0.0626, -0.0771], [-0.0961, -0.1548]]),
+      atol: 1.0e-3
+    )
+
+    assert_all_close(
+      outputs.logits_per_image,
+      Nx.tensor([[-0.0626, -0.0961], [-0.0771, -0.1548]]),
+      atol: 1.0e-3
+    )
+  end
+end


The PR introduces standalone text and vision encoder modules (SigLipText and SigLipVision) similar to CLIP, but only includes tests for the combined multimodal model. Following the pattern used by CLIP (which has separate test files for ClipText and ClipVision), there should be dedicated test files for Bumblebee.Text.SigLipText and Bumblebee.Vision.SigLipVision. This would ensure comprehensive test coverage for these standalone modules, particularly the :for_image_classification architecture supported by SigLipVision.

georgeguimaraes · 2026-01-03T16:57:35Z

Forgot to tell why I'm doing SigLIP. There's a cool NSFW image detection model that uses SigLIP2: https://huggingface.co/blog/prithivMLmods/image-guard-models

georgeguimaraes added 2 commits January 3, 2026 13:18

fix: Use SigLip naming convention and fix typo

2e2a60b

- Rename Siglip to SigLip to follow MpNet naming pattern - Fix docs: to doc: typo in siglip_vision.ex

Copilot AI review requested due to automatic review settings January 3, 2026 16:49

Copilot started reviewing on behalf of georgeguimaraes January 3, 2026 16:49 View session

Copilot AI reviewed Jan 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add SigLIP model support #438

feat: Add SigLIP model support #438

georgeguimaraes commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 3, 2026

Uh oh!

georgeguimaraes commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add SigLIP model support #438

Are you sure you want to change the base?

feat: Add SigLIP model support #438

Conversation

georgeguimaraes commented Jan 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

georgeguimaraes commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant