Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 22, 2025

Mirror from upstream PR: ggml-org#18256

Summary by CodeRabbit

  • New Features
    • Added support for Gemma3n multimodal (vision) model variants
    • Integrated MobileNetV5 vision encoder architecture for enhanced visual processing
    • Extended model registrations and tensor mappings for Gemma3n vision projector support

✏️ Tip: You can customize this high-level summary in your review settings.

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp

2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Walkthrough

Adds comprehensive support for Gemma3n multimodal (Vision) variants through a MobileNetV5-based vision encoder. Introduces new tensor types, constants, Python conversion utilities, and C++ graph builders to handle vision input processing and integration with the Gemma3n language model.

Changes

Cohort / File(s) Summary
Python Conversion Script
convert_hf_to_gguf.py
Introduced Gemma3nVisionModel class with vision-specific parameter handling (find_hparam, find_vparam), tensor quantization selection, and tensor name remapping for vision components. Extended Gemma3NModel registration to support both Gemma3nForCausalLM and Gemma3nForConditionalGeneration aliases. Added vocab override logic and per-layer tensor modification for multimodal variants.
Tensor Constants & Mappings
gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
Added GEMMA3N vision projector type enum and new MODEL_TENSOR members (V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, V_MM_POST_PROJ_NORM) with public name mappings (mm.embedding, mm.hard_emb_norm, mm.post_proj_norm). Extended tensor mapping configuration for MMPROJ and Gemma3n architectures.
Vision Encoder Implementation
tools/mtmd/models/mobilenetv5.cpp
tools/mtmd/models/models.h
Introduced complete clip_graph_mobilenetv5 graph builder class with helper functions for RMSNorm, SAME padding, edge residual blocks, inverted residual blocks, and multi-head attention blocks. Implements full MobileNetV5 architecture with stem convolution, staged blocks, multi-scale fusion adapter (MSFA), and Gemma3n-style embedding projection pipeline.
Vision Model Integration
tools/mtmd/clip-impl.h
tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
Added PROJECTOR_TYPE_GEMMA3N enum and expanded tensor name macros for MobileNetV5 components (stem, blocks, attention, MSFA). Extended clip_model struct with mobilenetv5_block, mobilenet_blocks, mobilenet_stage_ends, stem/normalization weights, and MSFA components. Integrated GEMMA3N pathways in graph construction, model loading, preprocessing, and token/embedding sizing.
Build Configuration & Gemma3n Inference
tools/mtmd/CMakeLists.txt
src/models/gemma3n-iswa.cpp
tools/mtmd/mtmd.cpp
Added models/mobilenetv5.cpp to mtmd library sources. Implemented embedding input path in get_per_layer_inputs to handle vision token projection and padding token broadcast. Aligned GEMMA3N token setup with GEMMA3 conventions for image boundaries and causal decoding.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only references an upstream mirror URL without providing substantive information about changes, scope, or rationale for the feature being added. Provide a meaningful description explaining the purpose of Gemma3n multimodal support, the role of MobileNetV5, testing approach, and any breaking changes or migration notes.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding support for Gemma3n multimodal models with a MobileNetV5 vision encoder, which is the primary objective across the multiple files modified.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson ngxson changed the title Add Gemma3n multimodal support with MobileNetV5 vision encoder [Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder Dec 22, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)

69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.

The clip_graph_mobilenetv5 struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations in mobilenetv5.cpp (per the code snippets).

The mobilenetv5_block type reference should resolve correctly via the include chain (clip-graph.hclip-model.h).

Optional: Consider making helper methods private

The helper methods (rms_norm_2d, pad_same_2d, build_edge_residual, build_inverted_residual, build_mobilenet_attn) are implementation details that could be declared as private. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.

tools/mtmd/clip-model.h (1)

330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.

The additions to clip_model are well-structured:

  • MobileNetV5 components use std::vector for flexible block management
  • MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
  • Naming conventions are consistent with existing fields

Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.

tools/mtmd/clip.cpp (2)

1619-1622: Use tensor name macros instead of hard-coded strings.

For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in clip-impl.h:

-                    model.mm_0_w = get_tensor("mm.embedding.weight", false);  // Input embedding
-                    model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false);  // Hard embedding norm
+                    model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false);  // Input embedding
+                    model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false);  // Hard embedding norm

This would require adding corresponding macros to clip-impl.h.


1528-1623: Consider adding bounds checking for stage index access.

The dynamic block discovery loop iterates over 4 stages (0-3), but the mobilenet_stage_ends vector is populated based on actual blocks found. When accessing model.mobilenet_stage_ends[2] and [3] later in mobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.

The current check at line 284-288 (if (model.mobilenet_stage_ends.size() >= 4)) in mobilenetv5.cpp handles this, but consider adding a validation after loading:

if (model.mobilenet_stage_ends.size() < 4) {
    LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", 
            __func__, model.mobilenet_stage_ends.size());
}
tools/mtmd/models/mobilenetv5.cpp (2)

329-336: Remove or use the commented-out variable scale_h.

The variable scale_h is declared but commented out. Either use it for validation or remove it to avoid confusion:

                 int scale_w = high_res_w / feat_w;
-                // int scale_h = high_res_h / feat_h;

If height scaling should be validated separately:

int scale_h = high_res_h / feat_h;
if (scale_w != scale_h) {
    LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", 
            __func__, scale_w, scale_h);
}

381-381: Consider making target output resolution configurable.

The target output resolution is hard-coded as 16:

const int target_out_res = 16;

Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f48807 and 86618c7.

📒 Files selected for processing (11)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • gguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
  • build (252-463)
  • build (252-252)
  • rms_norm_2d (5-20)
  • rms_norm_2d (5-5)
  • pad_same_2d (23-53)
  • pad_same_2d (23-23)
  • build_edge_residual (57-88)
  • build_edge_residual (57-57)
  • build_inverted_residual (90-151)
  • build_inverted_residual (90-90)
  • build_mobilenet_attn (154-250)
  • build_mobilenet_attn (154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
  • model (1159-1161)
  • model (1159-1159)
  • string_format (399-412)
  • string_format (399-399)
src/llama-model.cpp (2)
  • get_tensor (7044-7054)
  • get_tensor (7044-7044)
tools/server/server-context.cpp (2)
  • params (607-853)
  • params (607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
  • ggml_permute (3700-3752)
  • ggml_rms_norm (3066-3071)
  • ggml_pad_ext (4983-5016)
  • ggml_conv_2d_direct (4702-4736)
  • ggml_gelu (2677-2681)
  • ggml_conv_2d_dw (4637-4658)
  • ggml_reshape_4d (3583-3601)
  • ggml_reshape_3d (3564-3581)
  • ggml_scale (3290-3295)
  • ggml_soft_max (3966-3970)
  • ggml_upscale (4928-4935)
  • ggml_concat (2517-2544)
  • ggml_pool_2d (4852-4878)
tools/mtmd/clip.cpp (9)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • model (217-219)
  • model (935-1261)
  • model (935-935)
  • model (2038-2051)
  • model (2038-2038)
  • s (2446-2448)
  • s (2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
  • ctx (2490-2593)
  • ctx (2490-2490)
  • clip_get_projector_type (3737-3739)
  • clip_get_projector_type (3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1076-1077)
  • add_vision_attention_layernorm_eps (1097-1098)
gguf-py/gguf/constants.py (2)
  • VisionProjectorType (3408-3429)
  • GGMLQuantizationType (3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py

[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8

🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
  • GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
  • GitHub Check: android-build
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)

262-306: Vision input path implementation looks correct, but verify commented-out scaling.

The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via ggml_scale(..., 0.0f) is valid.

A few observations:

  1. Line 305: The sqrtf((float) n_embd_altup) scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.

  2. The inp->embd tensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created from per_layer_model_proj projection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.

tools/mtmd/mtmd.cpp (2)

269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.

The addition of PROJECTOR_TYPE_GEMMA3N alongside PROJECTOR_TYPE_GEMMA3 correctly sets up the same <start_of_image> and <end_of_image> tokens for the Gemma3n vision path.


861-866: LGTM - Non-causal decode handling extended to GEMMA3N.

The logic correctly includes PROJECTOR_TYPE_GEMMA3N in the non-causal decoding path, maintaining parity with GEMMA3.

Minor observation: Consider extracting the repeated clip_get_projector_type(ctx->ctx_v) call to a local variable for readability, though this is optional given the function is lightweight.

tools/mtmd/CMakeLists.txt (1)

30-30: LGTM - MobileNetV5 source file added to build.

The new models/mobilenetv5.cpp is correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.

gguf-py/gguf/tensor_mapping.py (1)

122-142: LGTM - New Gemma3n vision tensor mappings added.

The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the MODEL_TENSOR enums defined in constants.py.

Note: The comments label these as "gemma3n", which is accurate for V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, and V_MM_POST_PROJ_NORM. For V_MM_INP_PROJ and V_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.

tools/mtmd/clip-model.h (1)

175-212: Well-structured MobileNetV5 block definition.

The mobilenetv5_block struct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:

  • Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
  • Multi-Query Attention (MQA) components with optional downsampling
  • Layer scale and block normalization

The struct follows the existing naming conventions and patterns in the file.

gguf-py/gguf/constants.py (3)

670-672: LGTM: New gemma3n tensor types properly defined.

The three new tensor types (V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, V_MM_POST_PROJ_NORM) are:

  • Properly prefixed with V_MM_ following the existing naming convention
  • Clearly documented as gemma3n-specific
  • Correctly placed within the MODEL_TENSOR enum

1065-1067: Correct tensor mappings and MMPROJ integration.

The tensor name mappings and MMPROJ architecture additions are properly implemented:

  • String names follow the mm.* convention used for multimodal tensors
  • Tensors are correctly added to MODEL_TENSORS[MODEL_ARCH.MMPROJ] list
  • Consistent with existing patterns in the file

Also applies to: 1166-1168


1947-1981: Complete GEMMA3N architecture tensor list.

The MODEL_ARCH.GEMMA3N tensor list is comprehensive and well-organized:

  • Includes all standard Gemma3 tensors (token embedding, attention, FFN)
  • Properly extends with gemma3n-specific components:
    • Per-layer tensors (PER_LAYER_TOKEN_EMBD, PER_LAYER_MODEL_PROJ, etc.)
    • Altup tensors for alternative upsampling/routing
    • Laurel tensors for layer-wise processing
  • Comments clearly indicate the purpose of specialized tensor groups

This ensures proper serialization and deserialization of Gemma3n models.

tools/mtmd/clip.cpp (4)

3128-3133: Potential issue with n_patches calculation for GEMMA3N.

The calculation n_patches = image_size / patch_size computes a single dimension (e.g., 16 for 256/16), but n_patches typically represents the total number of patches (i.e., (image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.

If the output is indeed 16x16, the calculation should be:

-            n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            n_patches = n_per_side * n_per_side;

If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.


791-794: LGTM: GEMMA3N graph builder integration.

The new case for PROJECTOR_TYPE_GEMMA3N correctly uses the dedicated clip_graph_mobilenetv5 builder, consistent with how other projector types are handled.


1148-1155: LGTM: GEMMA3N hparams configuration.

The default n_merge = 1 with an optional override via KEY_PROJ_SCALE_FACTOR follows the established pattern for other projector types.


2859-2867: LGTM: GEMMA3N preprocessing.

The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (false parameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.

tools/mtmd/models/mobilenetv5.cpp (4)

5-20: LGTM: RMS Norm 2D helper implementation.

The rms_norm_2d helper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use of ggml_cont after permute ensures the tensor is contiguous for subsequent operations.


22-53: LGTM: SAME padding implementation.

The pad_same_2d helper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.


153-250: LGTM: Attention block implementation.

The build_mobilenet_attn function correctly implements multi-query attention with:

  • Optional input normalization
  • Downsampled K/V paths using depthwise convolutions
  • Proper Q/K/V reshaping and permutation for attention
  • Scaled dot-product attention with softmax
  • Output projection with optional layer scaling and residual connection

403-463: LGTM: Gemma3n multimodal projection.

The embedding/projection logic correctly:

  1. Permutes and flattens spatial dimensions to sequence format
  2. Applies feature scaling by sqrt(hidden_size)
  3. Applies soft embedding normalization with optional learned weight
  4. Projects to text hidden size via linear layer
  5. Applies post-projection RMSNorm

This matches the expected Gemma3n vision embedder architecture.

tools/mtmd/clip-impl.h (2)

156-195: LGTM: MobileNetV5 tensor name macros.

The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The %d.%d format for stage/block indexing aligns with the dynamic loading logic in clip.cpp.


214-214: LGTM: GEMMA3N projector type registration.

The new PROJECTOR_TYPE_GEMMA3N enum value and its string mapping "gemma3n" are correctly placed and follow the existing pattern.

Also applies to: 245-245

convert_hf_to_gguf.py (1)

522-527: Robust handling of empty tensor_map.mapping for block_count=0 looks good

Using a guarded branch for max_name_len avoids ValueError when gguf.get_tensor_name_map(..., block_count=0) produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.

Comment on lines +6008 to +6012
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Read the convert_hf_to_gguf.py file around lines 6008-6012
find . -name "convert_hf_to_gguf.py" -type f | head -1 | xargs -I {} sh -c 'wc -l {} && sed -n "6000,6020p" {} | cat -n'

Repository: ngxson/llama.cpp

Length of output: 1253


Patch size computation for Gemma3n MobileNetV5 is off by a factor of √image_seq_length

The code currently computes patch_size = 768 // 256 = 3, which results in n_per_side = 256 and tokens = 65,536—wildly inconsistent with the 16×16 grid (256 tokens) described in the comment.

The issue is treating image_seq_length (total token count) as if it were patch size. Gemma 3n uses MobileNet v5 with a default image resolution of 768×768 pixels, and image_seq_length defaults to 256.

The correct approach derives patch size from the per-side token count:

n_per_side = int(image_seq_length ** 0.5)  # 16 for 256 tokens
patch_size = image_size // n_per_side      # 768 // 16 = 48

This yields n_per_side = 16 and tokens = 256, matching the expected grid layout and HF processor config.

🧰 Tools
🪛 GitHub Actions: flake8 Lint

[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8

🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6008 to 6012, the code treats
image_seq_length as a per-side patch count which yields an incorrect tiny
patch_size; instead compute n_per_side = int(image_seq_length ** 0.5) to get the
number of patches per side (e.g. 16 for 256 tokens) and then set patch_size =
image_size // n_per_side (e.g. 768 // 16 = 48), and assign that patch_size back
into self.hparams_vision so the tokens and grid (n_per_side²) match the HF
processor config.

RESAMPLER = auto()
GLM_EDGE = auto()
MERGER = auto()
GEMMA3N = auto()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.

The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.

🔎 Proposed fix

Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:

 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
+    VISION_PROJECTOR_TYPE.QWEN3VL:   "qwen3vl_merger",
+    VISION_PROJECTOR_TYPE.COGVLM:    "cogvlm",
 }

Note: QWEN3VL and COGVLM are also missing from this dictionary.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GEMMA3N = auto()
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.


ggml_cgraph * clip_graph_mobilenetv5::build() {

fprintf(stderr, "\n--- START build_mobilenetv5 ---\n");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Replace debug fprintf with LOG_INF or remove.

Debug output to stderr should use the project's logging infrastructure for consistency. Either use LOG_INF or remove this debug statement before merging:

-    fprintf(stderr, "\n--- START build_mobilenetv5 ---\n");
+    LOG_INF("%s: building MobileNetV5 graph\n", __func__);

Alternatively, if this is temporary debug code, consider removing it entirely.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fprintf(stderr, "\n--- START build_mobilenetv5 ---\n");
LOG_INF("%s: building MobileNetV5 graph\n", __func__);
🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around line 254, there is a debug fprintf
to stderr that should not be used; replace the fprintf(stderr, "\n--- START
build_mobilenetv5 ---\n"); with the project's logging macro (e.g.,
LOG_INF("START build_mobilenetv5");) or remove the statement entirely if it was
temporary debug code, ensuring you include any necessary header or namespace for
LOG_INF if used.

Comment on lines +330 to +336
int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;

// Safety check for non-integer scaling if strictly replicating
if (high_res_w % feat_w != 0) {
fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use LOG_WRN for warning messages.

Replace fprintf with the project's logging macros:

-                if (high_res_w % feat_w != 0) { 
-                    fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n"); 
-                }
+                if (high_res_w % feat_w != 0) {
+                    LOG_WRN("%s: non-integer scaling detected in MSFA\n", __func__);
+                }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;
// Safety check for non-integer scaling if strictly replicating
if (high_res_w % feat_w != 0) {
fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n");
}
int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;
// Safety check for non-integer scaling if strictly replicating
if (high_res_w % feat_w != 0) {
LOG_WRN("%s: non-integer scaling detected in MSFA\n", __func__);
}
🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around lines 330 to 336, the code currently
uses fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n"); which
violates the project's logging conventions; replace that fprintf call with the
project's warning macro, e.g. LOG_WRN("Non-integer scaling detected in MSFA");
(use the same message text, remove the trailing newline, and ensure the file
includes the header that defines LOG_WRN if not already included).

Comment on lines +389 to +394
cur = ggml_pool_2d(ctx0, cur, GGML_OP_POOL_AVG, s, s, s, s, 0, 0);
} else {
fprintf(stderr, "Error: Irregular downsampling stride required.\n");
}

}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use LOG_ERR for error messages and consider aborting.

Replace fprintf with proper error logging. Also, irregular downsampling may indicate a misconfiguration that should be handled more robustly:

             } else {
-                fprintf(stderr, "Error: Irregular downsampling stride required.\n");
+                LOG_ERR("%s: irregular downsampling stride required (current_w=%d, target=%d)\n", 
+                        __func__, current_w, target_out_res);
+                // Consider: GGML_ABORT or return nullptr to signal failure
             }

Committable suggestion skipped: line range outside the PR's diff.

@ngxson

This comment was marked as resolved.

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review.


🏁 Script executed:

#!/bin/bash
# Fetch the upstream PR details
gh pr view 18256 --repo ggml-org/llama.cpp --json title,body,author

Length of output: 3458


🏁 Script executed:

#!/bin/bash
# Search for suspicious AI-generated code comment patterns
echo "=== Searching for suspicious AI-generated comments ==="
rg -i "CRITICAL:" --type py --type cpp --type c --type h || echo "No CRITICAL comments found"
rg -i "you are correct" --type py --type cpp --type c --type h || echo "No 'you are correct' comments found"
rg -i "TODO.*AI" --type py --type cpp --type c --type h || echo "No AI TODOs found"
rg -i "FIXME.*AI" --type py --type cpp --type c --type h || echo "No AI FIXMEs found"
rg -i "placeholder|stub|temporary" --type py --type cpp --type c --type h -C1 | head -50 || echo "No placeholder/stub comments found"

echo ""
echo "=== Checking for incomplete/commented code ==="
rg "//.*TODO|//.*FIXME|//.*XXX|//.*HACK" --type cpp --type c --type h -C1 | head -30 || echo "No flagged comments found"

Length of output: 10158


🏁 Script executed:

#!/bin/bash
# Review key implementation files for obvious issues

echo "=== Checking convert_hf_to_gguf.py Gemma3nVisionModel ==="
# Look at the Gemma3nVisionModel implementation
ast-grep --pattern 'class Gemma3nVisionModel' | head -100

echo ""
echo "=== Checking for potential issues in mobilenetv5.cpp ==="
# Check for common mistakes: division by zero, uninitialized variables, missing bounds checks
rg -n "/ (0|patch_size|stride)" tools/mtmd/models/mobilenetv5.cpp || echo "No obvious division issues"
rg -n "GGML_ASSERT|assert" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check tensor dimension handling ==="
# Look for potential dimension mismatches
rg -n "ne\[0\]|ne\[1\]|ne\[2\]|ne\[3\]" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Check for magic numbers ==="
rg -n "256|512|1024|2048" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 8976


🏁 Script executed:

#!/bin/bash
# Check for more specific issues

echo "=== Check patch_size calculation in convert_hf_to_gguf.py ==="
rg -n "patch_size = image_size //" convert_hf_to_gguf.py -A2 -B2

echo ""
echo "=== Check sqrt usage in patch_size calculation ==="
rg -n "sqrt|math.sqrt" convert_hf_to_gguf.py | grep -i patch || echo "No sqrt in patch calculation"

echo ""
echo "=== Check division operations in mobilenetv5.cpp ==="
rg -n "/ \w+;" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check for potential null pointer dereferences ==="
rg -n "block\.\w+->ne\[" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check gemma3n-iswa.cpp changes ==="
cat src/models/gemma3n-iswa.cpp | grep -A30 "inp->embd" | head -40

Length of output: 3571


🏁 Script executed:

#!/bin/bash
# Check the specific patch_size calculation logic

echo "=== Check image_seq_length and patch_size calculation logic ==="
rg -n "image_seq_length|patch_size" convert_hf_to_gguf.py -A1 -B1 | grep -A3 -B3 "6008\|6009\|6010\|6011"

echo ""
echo "=== Check if sqrt is needed for patch calculation ==="
echo "256 tokens = 16x16 grid, so sqrt(256) = 16"
echo "If image_size = 768 and we want 16x16 patches, patch_size should be 768/16 = 48"
echo "But code does: patch_size = 768 // 256 = 3"
echo ""

echo "=== Verify the calculation in context ==="
rg -n "image_seq_length.*256" convert_hf_to_gguf.py -B5 -A10

echo ""
echo "=== Check clip.cpp for n_patches calculation ==="
rg -n "n_patches.*GEMMA3N" tools/mtmd/clip.cpp -B5 -A5

Length of output: 2673


🏁 Script executed:

#!/bin/bash
# Check n_patches calculation in clip.cpp

echo "=== Find n_patches calculation for GEMMA3N ==="
rg -n "GEMMA3N" tools/mtmd/clip.cpp -B2 -A2 | grep -A5 -B5 "n_patches"

echo ""
echo "=== Check clip_n_output_tokens function ==="
rg -n "clip_n_output_tokens|n_output_tokens" tools/mtmd/clip.cpp -A10 | head -40

echo ""
echo "=== Compare with Gemma3 (non-N) implementation ==="
rg -n "case PROJECTOR_TYPE_GEMMA3:" tools/mtmd/clip.cpp -A5 | head -20

Length of output: 3033


🏁 Script executed:

#!/bin/bash
# Check the full n_patches calculation for GEMMA3N

echo "=== Check complete GEMMA3N n_patches logic ==="
rg -n "case PROJECTOR_TYPE_GEMMA3N:" tools/mtmd/clip.cpp -A10

echo ""
echo "=== Check if patch_size is used correctly ==="
rg -n "patch_size" tools/mtmd/clip.cpp | grep -i gemma3n -A2 -B2

echo ""
echo "=== Look for the formula: image_size / patch_size ==="
rg -n "image_size.*patch_size|patch_size.*image_size" tools/mtmd/clip.cpp

Length of output: 5510


🏁 Script executed:

#!/bin/bash
# Deep dive into potential issues

echo "=== Check if mobilenet block pointers can be null ==="
rg -n "block\.(dw_start_w|dw_mid_w|layer_scale_w)" tools/mtmd/models/mobilenetv5.cpp -B2 -A2

echo ""
echo "=== Check build_inverted_residual usage ==="
rg -n "build_inverted_residual|build_edge_residual|build_mobilenet_attn" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Verify model loading sets these pointers ==="
rg -n "dw_start_w|dw_mid_w|layer_scale" tools/mtmd/clip.cpp -A2 | head -50

Length of output: 4258


🏁 Script executed:

#!/bin/bash
# Final checks for issues

echo "=== Check for hardcoded values that should be configurable ==="
rg -n "2048|8.*head|4\s*\*.*hidden" convert_hf_to_gguf.py | grep -A2 -B2 "5990\|5994"

echo ""
echo "=== Check the Gemma3NModel vocab handling for potential issues ==="
rg -n "class Gemma3NModel" convert_hf_to_gguf.py -A100 | grep -A10 -B5 "vocab_size"

echo ""
echo "=== Check if there are any unused imports or variables ==="
rg -n "import.*mobilenet|from.*mobilenet" convert_hf_to_gguf.py

echo ""
echo "=== Check for potential integer overflow in calculations ==="
rg -n "int64_t|int32_t" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 4263


@ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings:

✅ Good News

No suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates.

🐛 Issues Found

1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)

The calculation patch_size = image_size // image_seq_length is semantically wrong:

# Current code:
image_seq_length = 256  # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length  # = 768 // 256 = 3

Problem: A patch_size of 3 pixels would mean dividing a 768px image into 256×256 tiny patches, not 16×16 patches as the comment states.

Should be:

n_per_side = int(image_seq_length ** 0.5)  # sqrt(256) = 16
patch_size = image_size // n_per_side  # = 768 // 16 = 48

This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects n_patches = image_size / patch_size, so the current buggy calculation accidentally produces the correct output (256 tokens) but with wrong semantics.

2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)

def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
    if "intermediate_size" in keys:
        hidden_size = self.hparams_vision.get("hidden_size", 2048)  # Hardcoded default
        return hidden_size * 4  # Hardcoded expansion factor
    
    if "num_attention_heads" in keys or "num_heads" in keys:
        return 8  # Hardcoded MQA heads

These should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof.

3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)

The code accesses block.layer_scale_w->ne[0] inside blocks where layer_scale_w is checked for null, but doesn't verify it's non-null before dereferencing:

if (block.layer_scale_w) {
    ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
        1, 1, block.layer_scale_w->ne[0], 1);  // Safe: inside if-block

This is actually safe due to the if-check, but similar patterns exist that should be verified.

4. Hardcoded assumption (mobilenetv5.cpp, line 407)

int C = cur->ne[2]; // Should be 2048

This comment reveals a hardcoded assumption about channel count that might not hold for all model variants.

5. Minor: Commented-out code (mobilenetv5.cpp, line 331)

int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;  // Unused

The contributor left commented-out code that should probably be removed.

📝 Recommendations

The patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality.


✏️ Learnings added
Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants