-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size
WalkthroughAdds comprehensive support for Gemma3n multimodal (Vision) variants through a MobileNetV5-based vision encoder. Introduces new tensor types, constants, Python conversion utilities, and C++ graph builders to handle vision input processing and integration with the Gemma3n language model. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)
69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.The
clip_graph_mobilenetv5struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations inmobilenetv5.cpp(per the code snippets).The
mobilenetv5_blocktype reference should resolve correctly via the include chain (clip-graph.h→clip-model.h).Optional: Consider making helper methods private
The helper methods (
rms_norm_2d,pad_same_2d,build_edge_residual,build_inverted_residual,build_mobilenet_attn) are implementation details that could be declared asprivate. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.tools/mtmd/clip-model.h (1)
330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.The additions to
clip_modelare well-structured:
- MobileNetV5 components use
std::vectorfor flexible block management- MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
- Naming conventions are consistent with existing fields
Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.
tools/mtmd/clip.cpp (2)
1619-1622: Use tensor name macros instead of hard-coded strings.For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in
clip-impl.h:- model.mm_0_w = get_tensor("mm.embedding.weight", false); // Input embedding - model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false); // Hard embedding norm + model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false); // Input embedding + model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false); // Hard embedding normThis would require adding corresponding macros to
clip-impl.h.
1528-1623: Consider adding bounds checking for stage index access.The dynamic block discovery loop iterates over 4 stages (0-3), but the
mobilenet_stage_endsvector is populated based on actual blocks found. When accessingmodel.mobilenet_stage_ends[2]and[3]later inmobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.The current check at line 284-288 (
if (model.mobilenet_stage_ends.size() >= 4)) inmobilenetv5.cpphandles this, but consider adding a validation after loading:if (model.mobilenet_stage_ends.size() < 4) { LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", __func__, model.mobilenet_stage_ends.size()); }tools/mtmd/models/mobilenetv5.cpp (2)
329-336: Remove or use the commented-out variablescale_h.The variable
scale_his declared but commented out. Either use it for validation or remove it to avoid confusion:int scale_w = high_res_w / feat_w; - // int scale_h = high_res_h / feat_h;If height scaling should be validated separately:
int scale_h = high_res_h / feat_h; if (scale_w != scale_h) { LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", __func__, scale_w, scale_h); }
381-381: Consider making target output resolution configurable.The target output resolution is hard-coded as 16:
const int target_out_res = 16;Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppsrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppgguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hconvert_hf_to_gguf.pygguf-py/gguf/constants.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.pygguf-py/gguf/constants.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hgguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
build(252-463)build(252-252)rms_norm_2d(5-20)rms_norm_2d(5-5)pad_same_2d(23-53)pad_same_2d(23-23)build_edge_residual(57-88)build_edge_residual(57-57)build_inverted_residual(90-151)build_inverted_residual(90-90)build_mobilenet_attn(154-250)build_mobilenet_attn(154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
model(1159-1161)model(1159-1159)string_format(399-412)string_format(399-399)src/llama-model.cpp (2)
get_tensor(7044-7054)get_tensor(7044-7044)tools/server/server-context.cpp (2)
params(607-853)params(607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
ggml_permute(3700-3752)ggml_rms_norm(3066-3071)ggml_pad_ext(4983-5016)ggml_conv_2d_direct(4702-4736)ggml_gelu(2677-2681)ggml_conv_2d_dw(4637-4658)ggml_reshape_4d(3583-3601)ggml_reshape_3d(3564-3581)ggml_scale(3290-3295)ggml_soft_max(3966-3970)ggml_upscale(4928-4935)ggml_concat(2517-2544)ggml_pool_2d(4852-4878)tools/mtmd/clip.cpp (9)
build_inp_raw(469-474)build_inp_raw(469-469)model(217-219)model(935-1261)model(935-935)model(2038-2051)model(2038-2038)s(2446-2448)s(2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
ctx(2490-2593)ctx(2490-2490)clip_get_projector_type(3737-3739)clip_get_projector_type(3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)gguf-py/gguf/constants.py (2)
VisionProjectorType(3408-3429)GGMLQuantizationType(3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py
[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: android-build
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)
262-306: Vision input path implementation looks correct, but verify commented-out scaling.The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via
ggml_scale(..., 0.0f)is valid.A few observations:
Line 305: The
sqrtf((float) n_embd_altup)scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.The
inp->embdtensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created fromper_layer_model_projprojection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.tools/mtmd/mtmd.cpp (2)
269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.The addition of
PROJECTOR_TYPE_GEMMA3NalongsidePROJECTOR_TYPE_GEMMA3correctly sets up the same<start_of_image>and<end_of_image>tokens for the Gemma3n vision path.
861-866: LGTM - Non-causal decode handling extended to GEMMA3N.The logic correctly includes
PROJECTOR_TYPE_GEMMA3Nin the non-causal decoding path, maintaining parity with GEMMA3.Minor observation: Consider extracting the repeated
clip_get_projector_type(ctx->ctx_v)call to a local variable for readability, though this is optional given the function is lightweight.tools/mtmd/CMakeLists.txt (1)
30-30: LGTM - MobileNetV5 source file added to build.The new
models/mobilenetv5.cppis correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.gguf-py/gguf/tensor_mapping.py (1)
122-142: LGTM - New Gemma3n vision tensor mappings added.The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the
MODEL_TENSORenums defined inconstants.py.Note: The comments label these as "gemma3n", which is accurate for
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM, andV_MM_POST_PROJ_NORM. ForV_MM_INP_PROJandV_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.tools/mtmd/clip-model.h (1)
175-212: Well-structured MobileNetV5 block definition.The
mobilenetv5_blockstruct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:
- Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
- Multi-Query Attention (MQA) components with optional downsampling
- Layer scale and block normalization
The struct follows the existing naming conventions and patterns in the file.
gguf-py/gguf/constants.py (3)
670-672: LGTM: New gemma3n tensor types properly defined.The three new tensor types (
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM,V_MM_POST_PROJ_NORM) are:
- Properly prefixed with
V_MM_following the existing naming convention- Clearly documented as gemma3n-specific
- Correctly placed within the MODEL_TENSOR enum
1065-1067: Correct tensor mappings and MMPROJ integration.The tensor name mappings and MMPROJ architecture additions are properly implemented:
- String names follow the
mm.*convention used for multimodal tensors- Tensors are correctly added to
MODEL_TENSORS[MODEL_ARCH.MMPROJ]list- Consistent with existing patterns in the file
Also applies to: 1166-1168
1947-1981: Complete GEMMA3N architecture tensor list.The
MODEL_ARCH.GEMMA3Ntensor list is comprehensive and well-organized:
- Includes all standard Gemma3 tensors (token embedding, attention, FFN)
- Properly extends with gemma3n-specific components:
- Per-layer tensors (
PER_LAYER_TOKEN_EMBD,PER_LAYER_MODEL_PROJ, etc.)- Altup tensors for alternative upsampling/routing
- Laurel tensors for layer-wise processing
- Comments clearly indicate the purpose of specialized tensor groups
This ensures proper serialization and deserialization of Gemma3n models.
tools/mtmd/clip.cpp (4)
3128-3133: Potential issue with n_patches calculation for GEMMA3N.The calculation
n_patches = image_size / patch_sizecomputes a single dimension (e.g., 16 for 256/16), butn_patchestypically represents the total number of patches (i.e.,(image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.If the output is indeed 16x16, the calculation should be:
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + n_patches = n_per_side * n_per_side;If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.
791-794: LGTM: GEMMA3N graph builder integration.The new case for
PROJECTOR_TYPE_GEMMA3Ncorrectly uses the dedicatedclip_graph_mobilenetv5builder, consistent with how other projector types are handled.
1148-1155: LGTM: GEMMA3N hparams configuration.The default
n_merge = 1with an optional override viaKEY_PROJ_SCALE_FACTORfollows the established pattern for other projector types.
2859-2867: LGTM: GEMMA3N preprocessing.The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (
falseparameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.tools/mtmd/models/mobilenetv5.cpp (4)
5-20: LGTM: RMS Norm 2D helper implementation.The
rms_norm_2dhelper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use ofggml_contafter permute ensures the tensor is contiguous for subsequent operations.
22-53: LGTM: SAME padding implementation.The
pad_same_2dhelper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.
153-250: LGTM: Attention block implementation.The
build_mobilenet_attnfunction correctly implements multi-query attention with:
- Optional input normalization
- Downsampled K/V paths using depthwise convolutions
- Proper Q/K/V reshaping and permutation for attention
- Scaled dot-product attention with softmax
- Output projection with optional layer scaling and residual connection
403-463: LGTM: Gemma3n multimodal projection.The embedding/projection logic correctly:
- Permutes and flattens spatial dimensions to sequence format
- Applies feature scaling by sqrt(hidden_size)
- Applies soft embedding normalization with optional learned weight
- Projects to text hidden size via linear layer
- Applies post-projection RMSNorm
This matches the expected Gemma3n vision embedder architecture.
tools/mtmd/clip-impl.h (2)
156-195: LGTM: MobileNetV5 tensor name macros.The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The
%d.%dformat for stage/block indexing aligns with the dynamic loading logic inclip.cpp.
214-214: LGTM: GEMMA3N projector type registration.The new
PROJECTOR_TYPE_GEMMA3Nenum value and its string mapping"gemma3n"are correctly placed and follow the existing pattern.Also applies to: 245-245
convert_hf_to_gguf.py (1)
522-527: Robust handling of emptytensor_map.mappingfor block_count=0 looks goodUsing a guarded branch for
max_name_lenavoidsValueErrorwhengguf.get_tensor_name_map(..., block_count=0)produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.
| # Image sequence length (256 tokens = 16x16 for Gemma3n) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| image_size = self.hparams_vision["image_size"] | ||
| self.hparams_vision["patch_size"] = image_size // image_seq_length | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Read the convert_hf_to_gguf.py file around lines 6008-6012
find . -name "convert_hf_to_gguf.py" -type f | head -1 | xargs -I {} sh -c 'wc -l {} && sed -n "6000,6020p" {} | cat -n'Repository: ngxson/llama.cpp
Length of output: 1253
Patch size computation for Gemma3n MobileNetV5 is off by a factor of √image_seq_length
The code currently computes patch_size = 768 // 256 = 3, which results in n_per_side = 256 and tokens = 65,536—wildly inconsistent with the 16×16 grid (256 tokens) described in the comment.
The issue is treating image_seq_length (total token count) as if it were patch size. Gemma 3n uses MobileNet v5 with a default image resolution of 768×768 pixels, and image_seq_length defaults to 256.
The correct approach derives patch size from the per-side token count:
n_per_side = int(image_seq_length ** 0.5) # 16 for 256 tokens
patch_size = image_size // n_per_side # 768 // 16 = 48This yields n_per_side = 16 and tokens = 256, matching the expected grid layout and HF processor config.
🧰 Tools
🪛 GitHub Actions: flake8 Lint
[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6008 to 6012, the code treats
image_seq_length as a per-side patch count which yields an incorrect tiny
patch_size; instead compute n_per_side = int(image_seq_length ** 0.5) to get the
number of patches per side (e.g. 16 for 256 tokens) and then set patch_size =
image_size // n_per_side (e.g. 768 // 16 = 48), and assign that patch_size back
into self.hparams_vision so the tokens and grid (n_per_side²) match the HF
processor config.
| RESAMPLER = auto() | ||
| GLM_EDGE = auto() | ||
| MERGER = auto() | ||
| GEMMA3N = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.
The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.
🔎 Proposed fix
Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
+ VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
+ VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
+ VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}Note: QWEN3VL and COGVLM are also missing from this dictionary.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| GEMMA3N = auto() | |
| VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { | |
| VISION_PROJECTOR_TYPE.MLP: "mlp", | |
| VISION_PROJECTOR_TYPE.LDP: "ldp", | |
| VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", | |
| VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", | |
| VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", | |
| VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", | |
| VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", | |
| VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", | |
| VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger", | |
| VISION_PROJECTOR_TYPE.COGVLM: "cogvlm", | |
| } |
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.
|
|
||
| ggml_cgraph * clip_graph_mobilenetv5::build() { | ||
|
|
||
| fprintf(stderr, "\n--- START build_mobilenetv5 ---\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Replace debug fprintf with LOG_INF or remove.
Debug output to stderr should use the project's logging infrastructure for consistency. Either use LOG_INF or remove this debug statement before merging:
- fprintf(stderr, "\n--- START build_mobilenetv5 ---\n");
+ LOG_INF("%s: building MobileNetV5 graph\n", __func__);Alternatively, if this is temporary debug code, consider removing it entirely.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fprintf(stderr, "\n--- START build_mobilenetv5 ---\n"); | |
| LOG_INF("%s: building MobileNetV5 graph\n", __func__); |
🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around line 254, there is a debug fprintf
to stderr that should not be used; replace the fprintf(stderr, "\n--- START
build_mobilenetv5 ---\n"); with the project's logging macro (e.g.,
LOG_INF("START build_mobilenetv5");) or remove the statement entirely if it was
temporary debug code, ensuring you include any necessary header or namespace for
LOG_INF if used.
| int scale_w = high_res_w / feat_w; | ||
| // int scale_h = high_res_h / feat_h; | ||
|
|
||
| // Safety check for non-integer scaling if strictly replicating | ||
| if (high_res_w % feat_w != 0) { | ||
| fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use LOG_WRN for warning messages.
Replace fprintf with the project's logging macros:
- if (high_res_w % feat_w != 0) {
- fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n");
- }
+ if (high_res_w % feat_w != 0) {
+ LOG_WRN("%s: non-integer scaling detected in MSFA\n", __func__);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int scale_w = high_res_w / feat_w; | |
| // int scale_h = high_res_h / feat_h; | |
| // Safety check for non-integer scaling if strictly replicating | |
| if (high_res_w % feat_w != 0) { | |
| fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n"); | |
| } | |
| int scale_w = high_res_w / feat_w; | |
| // int scale_h = high_res_h / feat_h; | |
| // Safety check for non-integer scaling if strictly replicating | |
| if (high_res_w % feat_w != 0) { | |
| LOG_WRN("%s: non-integer scaling detected in MSFA\n", __func__); | |
| } |
🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around lines 330 to 336, the code currently
uses fprintf(stderr, "Warning: Non-integer scaling detected in MSFA\n"); which
violates the project's logging conventions; replace that fprintf call with the
project's warning macro, e.g. LOG_WRN("Non-integer scaling detected in MSFA");
(use the same message text, remove the trailing newline, and ensure the file
includes the header that defines LOG_WRN if not already included).
| cur = ggml_pool_2d(ctx0, cur, GGML_OP_POOL_AVG, s, s, s, s, 0, 0); | ||
| } else { | ||
| fprintf(stderr, "Error: Irregular downsampling stride required.\n"); | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use LOG_ERR for error messages and consider aborting.
Replace fprintf with proper error logging. Also, irregular downsampling may indicate a misconfiguration that should be handled more robustly:
} else {
- fprintf(stderr, "Error: Irregular downsampling stride required.\n");
+ LOG_ERR("%s: irregular downsampling stride required (current_w=%d, target=%d)\n",
+ __func__, current_w, target_out_res);
+ // Consider: GGML_ABORT or return nullptr to signal failure
}Committable suggestion skipped: line range outside the PR's diff.
This comment was marked as resolved.
This comment was marked as resolved.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review. 🏁 Script executed: Length of output: 3458 🏁 Script executed: Length of output: 10158 🏁 Script executed: Length of output: 8976 🏁 Script executed: Length of output: 3571 🏁 Script executed: Length of output: 2673 🏁 Script executed: Length of output: 3033 🏁 Script executed: Length of output: 5510 🏁 Script executed: Length of output: 4258 🏁 Script executed: Length of output: 4263 @ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings: ✅ Good NewsNo suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates. 🐛 Issues Found1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)The calculation # Current code:
image_seq_length = 256 # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length # = 768 // 256 = 3Problem: A Should be: n_per_side = int(image_seq_length ** 0.5) # sqrt(256) = 16
patch_size = image_size // n_per_side # = 768 // 16 = 48This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects 2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
if "intermediate_size" in keys:
hidden_size = self.hparams_vision.get("hidden_size", 2048) # Hardcoded default
return hidden_size * 4 # Hardcoded expansion factor
if "num_attention_heads" in keys or "num_heads" in keys:
return 8 # Hardcoded MQA headsThese should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof. 3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)The code accesses if (block.layer_scale_w) {
ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
1, 1, block.layer_scale_w->ne[0], 1); // Safe: inside if-blockThis is actually safe due to the if-check, but similar patterns exist that should be verified. 4. Hardcoded assumption (mobilenetv5.cpp, line 407)int C = cur->ne[2]; // Should be 2048This comment reveals a hardcoded assumption about channel count that might not hold for all model variants. 5. Minor: Commented-out code (mobilenetv5.cpp, line 331)int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h; // UnusedThe contributor left commented-out code that should probably be removed. 📝 RecommendationsThe patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality. ✏️ Learnings added
🧠 Learnings used |
Mirror from upstream PR: ggml-org#18256
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.