[Mirror] Support Youtu-VL Model #67

ngxson · 2025-12-23T09:47:34Z

Mirror from upstream PR: ggml-org#18315

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

New Features
- Added support for UTUVL vision model architecture (conversion, preprocessing, and inference paths).
- Improved image preprocessing with dynamic resizing, spatial merging and windowed-attention support for vision transformers.
- Extended token-embedding handling to better support tied word-embeddings and output duplication when needed.
- Added UTU‑VL pre-tokenization and broader Unicode category handling for more robust tokenization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-23T09:47:50Z

Walkthrough

Adds end-to-end UTUVL vision model support (conversion, GGUF params, projector, graph, preprocessing, tokenization, tensor mappings) and tie-word-embeddings handling for DeepseekV2; also updates Unicode \p{...} parsing and build units to include UTUVL.

Changes

Cohort / File(s)	Summary
Conversion & Model Registry `convert_hf_to_gguf.py`	Adds `UtuVLVisionModel(MmprojModel)` class, registers `"UtuVLForConditionalGeneration"` and `"UTUVLForCausalLM"`, configures UTUVL-specific GGUF parameters and tensor mapping paths.
DeepseekV2 embedding tying `convert_hf_to_gguf.py`	Adds `_token_embd: Tensor
GGUF / Projector constants `gguf-py/gguf/constants.py`, `tools/mtmd/clip-impl.h`	Adds `UTUVL = "utuvl"` constant and `PROJECTOR_TYPE_UTUVL` enum entry with name mapping.
Tensor name mappings `gguf-py/gguf/tensor_mapping.py`	Extends TensorNameMap with additional Siglip2 source keys for vision tensors (patch embed, encoder Q/K/V/O, norms, FFN, layer entries).
Vision projector implementation `tools/mtmd/clip.cpp`, `tools/mtmd/models/models.h`, `tools/mtmd/models/utuvl.cpp`	Adds `clip_graph_utuvl` declaration and `build()` implementation; integrates UTUVL into clip.cpp flows (graph selection, hyperparams, tensor loading, preprocessing, token counting, batch encoding, embedding sizing).
Tokenizer / Vocab updates `src/llama-vocab.h`, `src/llama-vocab.cpp`	Adds `LLAMA_VOCAB_PRE_TYPE_UTU_VL` enum and registers `"utu-vl"` pre-tokenizer with BPE-specific regex and flags (`clean_spaces=false`, `ignore_merges=true`).
Unicode parsing `src/unicode.cpp`	Replaces fixed-length `\p{...}` parsing with dynamic brace-detection and category validation; registers additional Unicode letter categories (Lu, Ll, Lt, Lm, Lo).
Build & cmake `tools/mtmd/CMakeLists.txt`	Adds `models/utuvl.cpp` to mtmd library sources.
mtmd initialization `tools/mtmd/mtmd.cpp`	Includes `PROJECTOR_TYPE_UTUVL` in vision-projector branch for image delimiter initialization.

Sequence Diagram

sequenceDiagram
    actor User
    participant Loader as ModelLoader
    participant Registry as ModelRegistry
    participant Projector as ProjectorBuilder
    participant Preproc as ImageProcessor
    participant Graph as VisionGraph
    participant Tensors as TensorOps

    User->>Registry: Request load UTUVL model
    Registry->>Loader: Instantiate UtuVLVisionModel
    Loader->>Projector: Recognize PROJECTOR_TYPE_UTUVL
    Projector->>Graph: Build clip_graph_utuvl

    User->>Preproc: Submit image
    Preproc->>Preproc: Compute merge-aware patch grid & normalize
    Preproc->>Graph: Provide image tensor

    Note over Graph: Windowed attention / transformer pass
    Graph->>Tensors: Apply RoPE & proj (Q/K/V)
    Graph->>Tensors: Apply window mask & attention
    Graph->>Tensors: Residuals & FFN per layer
    Graph->>Tensors: VLPatchMerger (RMS, reshape, FFN)

    Graph->>User: Return vision embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.85% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[Mirror] Support Youtu-VL Model' accurately describes the primary change—adding support for the Youtu-VL model—and is clear, concise, and specific.
Description check	✅ Passed	The description is minimal but provides sufficient context by identifying this as a mirror PR of upstream ggml-org#18315 and noting the purpose is to add Youtu-VL model support.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

convert_hf_to_gguf.py (1)

7247-7344: OUTPUT tensor bypass of quantization is intentional but undocumented; embedding naming assumption is reasonable for DeepseekV2.

The tied-embeddings implementation in DeepseekV2Model is functionally correct but has two points worth clarifying:

Quantization bypass for OUTPUT is real: The code calls self.gguf_writer.add_tensor(output_name, self._token_embd.numpy()) directly, which bypasses the quantization pipeline in ModelBase.prepare_tensors(). This means OUTPUT will be stored at the original tensor precision (likely F16) rather than following the model's ftype setting (e.g., MOSTLY_Q8_0). This appears intentional for tied-embedding models to preserve logit precision, but should be documented if that's the design goal.

Embedding naming assumption is reasonable: The check for "model.embed_tokens.weight" is the correct naming for DeepseekV2 architecture variants. If future variants use different naming (e.g., without the model. prefix), the OUTPUT tensor would not be written, but this is unlikely for the DeepseekV2/UTUVL family.

If high-precision OUTPUT tensors are intended for quality, consider adding a comment explaining this trade-off. If OUTPUT should follow quantization settings instead, capture the quantized tensor from the parent prepare_tensors() call rather than the raw _token_embd.

🧹 Nitpick comments (8)

src/llama-vocab.cpp (1)
1869-1873: Minor: Trailing whitespace on line 1873.

The tokenizer configuration looks correct with clean_spaces = false and ignore_merges = true, consistent with similar pre-tokenizers like llama3 and tekken.
🔎 Proposed fix
             } else if (
                     tokenizer_pre == "utu-vl") {
                 pre_type = LLAMA_VOCAB_PRE_TYPE_UTU_VL;
                 clean_spaces = false;
-                ignore_merges = true; 
+                ignore_merges = true;
tools/mtmd/models/utuvl.cpp (2)
160-160: Inconsistent epsilon value - uses hardcoded 1e-6 instead of member eps.

The RMS norm for the merger uses a hardcoded 1e-6 epsilon, while other norms in this file use the member variable eps. If this is intentional (perhaps the merger requires a different epsilon), consider adding a comment to clarify.
🔎 Proposed fix (if same eps is intended)
-    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, 1e-6, -1);
+    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, eps, -1);
173-185: Remove commented-out dead code.

This block was replaced by the build_ffn call above. Commented-out code should be removed to improve maintainability.
🔎 Proposed fix
     embeddings = build_ffn(embeddings,
                     model.mm_0_w, model.mm_0_b,
                     nullptr, nullptr,
                     model.mm_1_w, model.mm_1_b,
                     FFN_GELU,
                     -1);
-    // // 3. First linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
-    // cb(embeddings, "merger_fc1", -1);
-
-    // // 4. GELU activation
-    // embeddings = ggml_gelu(ctx0, embeddings);
-    // cb(embeddings, "merger_gelu", -1);
-
-    // // 5. Second linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_1_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_1_b);
 
     // build the graph
tools/mtmd/clip.cpp (3)
1165-1175: Consider aligning warning messages with QWEN25VL for consistency.

The warning for UTUVL is more terse compared to QWEN25VL which provides additional context about the --image-min-tokens flag and a reference link. Also, QWEN25VL loads n_wa_pattern from config while UTUVL hardcodes window attention behavior in the batch encoder.

If window attention pattern should be configurable, consider loading it from the model config similar to QWEN25VL.
🔎 Suggested improvement for warning message
                         const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
                         if (hparams.image_min_pixels < warn_min_pixels) {
                             LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                         }
2710-2760: Potential edge cases in scaling loop.

A few observations about the preprocessing logic:

Loop termination: The while (scale > 0.0f) loop could theoretically reach very small or zero scale values for extremely large images. Consider adding a minimum scale threshold to prevent potential issues.

Fallback value: When params.image_max_pixels is 0, max_num_patches becomes 0, which would always trigger the scale reduction.

Comment accuracy: The comment mentions "binary search" but the implementation is a linear search with 0.02 decrements.
🔎 Suggested defensive bounds
-                // Binary search for optimal scale to fit within max_num_patches
+                // Linear search for optimal scale to fit within max_num_patches
                 float scale = 1.0f;
                 int target_height = original_size.height;
                 int target_width = original_size.width;
                 
                 auto get_scaled_image_size = [align_size](float scale, int size) -> int {
                     float scaled_size = size * scale;
                     // Round up to nearest multiple of align_size
                     int aligned = static_cast<int>(std::ceil(scaled_size / align_size)) * align_size;
                     // Ensure at least one patch
                     return std::max(align_size, aligned);
                 };
                 
-                // Binary search with 0.02 step size
-                while (scale > 0.0f) {
+                // Linear search with 0.02 step size
+                const float min_scale = 0.02f;
+                while (scale >= min_scale) {
                     target_height = get_scaled_image_size(scale, original_size.height);
                     target_width = get_scaled_image_size(scale, original_size.width);
                     
                     int num_patches_h = target_height / patch_size;
                     int num_patches_w = target_width / patch_size;
                     int num_patches = num_patches_h * num_patches_w;
                     
                     if (num_patches > max_num_patches) {
                         scale -= 0.02f;
                     } else {
                         break;
                     }
                 }
3426-3476: Dead code and hardcoded window attention.

Dead code: Since use_window_attn is hardcoded to true (line 3428), the else branch at lines 3472-3475 will never execute.

Consider making window attention configurable: Unlike QWEN25VL which reads n_wa_pattern from model config to determine window attention behavior, UTUVL hardcodes it. If this is intentional (UTUVL always requires window attention), consider removing the dead else branch or adding a comment explaining this design choice.
🔎 Remove dead code
         case PROJECTOR_TYPE_UTUVL:
             {
-                const bool use_window_attn = true;
                 const int merge_ratio = 2;
                 const int pw  = image_size_width  / patch_size / merge_ratio;  // patches after merger
                 const int ph  = image_size_height / patch_size / merge_ratio;
                 const int ipw = image_size_width  / patch_size;  // internal patches in ViT
                 const int iph = image_size_height / patch_size;
                 std::vector<int> idx    (ph * pw);
                 std::vector<int> inv_idx(ph * pw);
-                if (use_window_attn) {
+                // UTUVL always uses window attention
+                {
                     const int attn_window_size = patch_size * 2 * 8;
                     // ... rest of window attention code ...
-                } else {
-                    for (int i = 0; i < ph * pw; i++) {
-                        idx[i] = i;
-                    }
                 }
convert_hf_to_gguf.py (2)

7138-7237: DeepseekV2 MoE GGUF metadata wiring for UTUVL is reasonable but a bit implicit.

Registering "UTUVLForCausalLM" on DeepseekV2Model matches how other multimodal families register both text and mmproj variants under the same HF arch name; no issue there.

In set_gguf_parameters, the new MoE section:

Falls back from moe_intermediate_size to intermediate_size (or 0) for expert_feed_forward_length, which keeps GGUF consistent for both MoE and dense-only DeepseekV2‑style configs.

Writes expert_count, expert_shared_count (default 0), expert_weights_scale (default 1.0), and expert_weights_norm when norm_topk_prob is truthy. This mirrors how other MoE models in this file are handled and should be safe.
Two minor nits you might want to consider (not blockers):

If a model is not MoE but happens to have an intermediate_size, you’ll still emit expert_feed_forward_length with a non‑zero value; if that’s unintended, consider guarding all MoE metadata behind a single if hparams.get("n_routed_experts") is not None: block.

add_expert_weights_norm is called with hparams["norm_topk_prob"] which appears to be boolean in Deepseek configs; if any future variant uses this field differently, this call might need to be revisited.

Given this mirrors upstream behavior, I’d leave it as‑is unless you see non‑MoE DeepseekV2 models hitting this path.

10501-10541: UtuVLVisionModel implementation looks sound; small polish opportunities only.

Behavior:

Registration under both "UtuVLForConditionalGeneration" and "UTUVLForCausalLM" on an MmprojModel mirrors how other VL families (e.g. Qwen2VL) expose both text and vision variants; this is consistent.

set_gguf_parameters:

Calls super().set_gguf_parameters() so all generic vision/audio metadata is preserved.

Sets projector type to UTUVL.

Uses self.hparams["layer_norm_eps"] (inherited from MmprojModel, which rebinds self.hparams to the vision config) and a tolerant activation check:

Treats several GELU aliases (gelu_pytorch_tanh, etc.) as vision_use_gelu=True.

Supports silu via vision_use_silu=True.

Raises ValueError for anything else, which is an appropriate hard‑fail given unknown activations would break runtime.

Writes vision_spatial_merge_size with a default of 2 when missing, which is at least explicit and will not crash.

modify_tensors:

Cleanly drops obvious language‑model tensors (lm_head., model.layers., model.embed_tokens., model.norm.).

For everything else, delegates to self.map_tensor_name(name) and, on failure, logs a warning and silently drops the tensor. That’s in line with other vision subclasses that prefer “best‑effort + warning” over hard failure.

Optional polish (not required):

Ruff’s TRY003/TRY300 hints here are stylistic only:

The ValueError(f"Unsupported activation function for UTUVL: {hidden_act}") could be shortened slightly, but it’s already clear and localized.

The try/except ValueError in modify_tensors is fine as written; splitting the return [] into an else block would be marginally clearer but not necessary.

From a functional standpoint, this class fits well with the existing multimodal patterns in this file.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed75977 and 8733bf3.

📒 Files selected for processing (12)

convert_hf_to_gguf.py
gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
src/llama-vocab.cpp
src/llama-vocab.h
src/unicode.cpp
tools/mtmd/CMakeLists.txt
tools/mtmd/clip-impl.h
tools/mtmd/clip.cpp
tools/mtmd/models/models.h
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

src/llama-vocab.h
tools/mtmd/models/models.h
src/llama-vocab.cpp
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp
src/unicode.cpp
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

src/llama-vocab.h
gguf-py/gguf/constants.py
tools/mtmd/models/models.h
src/llama-vocab.cpp
tools/mtmd/models/utuvl.cpp
tools/mtmd/mtmd.cpp
src/unicode.cpp
tools/mtmd/clip.cpp
gguf-py/gguf/tensor_mapping.py
tools/mtmd/clip-impl.h
convert_hf_to_gguf.py

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
convert_hf_to_gguf.py

src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

src/llama-vocab.cpp
src/unicode.cpp

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

tools/mtmd/CMakeLists.txt
tools/mtmd/mtmd.cpp
tools/mtmd/clip.cpp
tools/mtmd/clip-impl.h

🧬 Code graph analysis (3)

tools/mtmd/models/models.h (1)

tools/mtmd/models/utuvl.cpp (2)

build (3-190)

build (3-3)

tools/mtmd/models/utuvl.cpp (1)

ggml/src/ggml.c (9)

ggml_reshape_4d (3583-3601)

ggml_permute (3700-3752)

ggml_cont_4d (3492-3508)

ggml_mul_mat (3174-3189)

ggml_add (1969-1974)

ggml_reshape_2d (3546-3562)

ggml_set_name (1841-1848)

ggml_cast (3433-3445)

ggml_build_forward_expand (6793-6795)

convert_hf_to_gguf.py (3)

gguf-py/gguf/gguf_writer.py (10)

add_expert_feed_forward_length (714-715)

add_expert_count (800-801)

add_expert_shared_count (806-807)

add_expert_weights_scale (815-816)

add_expert_weights_norm (818-819)

add_clip_projector_type (1083-1084)

add_vision_attention_layernorm_eps (1104-1105)

add_vision_use_gelu (1122-1123)

add_vision_use_silu (1125-1126)

add_vision_spatial_merge_size (1119-1120)

convert_lora_to_gguf.py (1)

modify_tensors (448-473)

gguf-py/gguf/constants.py (1)

VisionProjectorType (3414-3435)

🪛 Ruff (0.14.10)

convert_hf_to_gguf.py

10521-10521: Avoid specifying long messages outside the exception class

(TRY003)

10536-10536: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
GitHub Check: build-cmake-pkg / linux
GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
GitHub Check: ubuntu-24-cmake-vulkan-deb
GitHub Check: ubuntu-24-cmake-webgpu
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: ios-xcode-build
GitHub Check: macOS-latest-cmake-ios
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: macOS-latest-cmake-arm64

🔇 Additional comments (19)

src/unicode.cpp (2)

967-971: LGTM! Unicode letter subcategories correctly added.

The addition of Lu, Ll, Lt, Lm, and Lo subcategories mapping to the LETTER flag is correct and follows the existing pattern. The inline comments clearly document each subcategory type.

1082-1103: LGTM! Dynamic \p{...} pattern resolution is well-implemented.

The replacement of fixed-length pattern matching with dynamic resolution correctly:

Searches for the closing brace dynamically (Line 1087)

Validates the pattern length with a reasonable limit (Line 1088)

Conditionally inserts brackets to avoid nesting when not inside a character class (Lines 1091-1098)

Advances the loop index correctly past the matched pattern (Line 1099)

The boundary checks are sound, and the logic handles varying-length Unicode property patterns correctly.

gguf-py/gguf/constants.py (1)

3414-3435: UTUVL projector string is consistent with existing naming

VisionProjectorType.UTUVL = "utuvl" aligns with the new C++ projector type name and other projector strings; no issues here.

gguf-py/gguf/tensor_mapping.py (1)

1218-1458: UTUVL/Siglip2 tensor mappings look consistent; validate against real checkpoints

The added "merger.*" and "siglip2.vision_model.*" source keys for V_MMPROJ, V_ENC_* and related tensors are consistent with existing patterns for other vision models and correctly target the corresponding MODEL_TENSOR entries. Behavior remains purely additive.

It would still be good to run at least one UTUVL HF → GGUF conversion and load in mtmd to confirm all expected tensors are picked up and no additional Siglip2 variants are missing.

tools/mtmd/CMakeLists.txt (1)

5-31: UTUVL model source correctly wired into mtmd target

Adding models/utuvl.cpp to the mtmd library matches the pattern of other model sources and is sufficient to include the new graph implementation in the build.

tools/mtmd/clip-impl.h (1)

163-221: Enum and string mapping for UTUVL are correctly aligned

PROJECTOR_TYPE_UTUVL is added to projector_type and mapped to "utuvl" in PROJECTOR_TYPE_NAMES, matching the Python-side constant and gguf metadata string. No issues seen.

src/llama-vocab.h (1)

9-55: New UTU_VL pre-tokenizer enum value is fine

Appending LLAMA_VOCAB_PRE_TYPE_UTU_VL = 43 to llama_vocab_pre_type is consistent with existing entries and keeps ordinal values stable. Assuming the corresponding handling is implemented in llama-vocab.cpp, this looks good.

tools/mtmd/mtmd.cpp (1)

268-290: UTUVL reuses Qwen-style vision delimiters correctly

Including PROJECTOR_TYPE_UTUVL in the Qwen2VL/Qwen25VL/Qwen3VL branch makes UTUVL use <|vision_start|> / <|vision_end|> markers, which is consistent with how other Qwen-based VL projectors are handled.

Please double‑check that the UTUVL text tokenizer indeed defines these exact special tokens.

tools/mtmd/models/models.h (1)

25-28: UTUVL clip graph declaration matches existing pattern

clip_graph_utuvl follows the same inheritance and constructor pattern as other clip_graph_* types and declares a virtual build() override, which fits the documented implementation in utuvl.cpp.

src/llama-vocab.cpp (1)

317-322: LGTM - UTU_VL regex patterns are well-structured.

The regex expressions correctly handle Korean Hangul, CJK punctuation, Bopomofo, and Chinese/Japanese characters, followed by a comprehensive pattern for mixed-case words with contractions. This follows the established pattern for other multilingual tokenizers in the codebase.

tools/mtmd/models/utuvl.cpp (3)

3-14: LGTM - Clean initialization with appropriate assertions.

The setup correctly initializes window attention parameters. The batch_size = 1 constraint is properly enforced via GGML_ASSERT statements later in the code (lines 79, 146).

18-49: LGTM - Window partitioning logic for patch embeddings.

The sequence of reshape/permute/cont operations correctly reorganizes patches into a windowed format for efficient attention computation. The math appears consistent: patches are organized into 2×2 windows (m=2) and the final shape matches the expected (3*patch_size*patch_size, Hm*Wm*m*m, 1) before the patch embedding projection.

86-138: LGTM - Standard transformer layer implementation with windowed attention.

The layer loop correctly implements:

Pre-norm architecture with residual connections

Q/K/V projections with RoPE positional encoding

Windowed attention with full attention every 8th layer and on the last layer ((il + 1) % 8 == 0 || il == n_layer - 1)

FFN with configurable activation

tools/mtmd/clip.cpp (5)

848-851: LGTM!

The UTUVL case follows the established pattern for other projector types in the graph builder switch statement.

1511-1518: LGTM!

Tensor loading correctly loads UTUVL-specific tensors including the RMS normalization weight for the merger.

2993-2994: LGTM!

Token counting functions correctly include UTUVL alongside related projector types (QWEN2VL, QWEN25VL, QWEN3VL, GLM4V) that share the same merge-based output token calculation.

Also applies to: 3009-3010, 3070-3076

3477-3496: LGTM!

Position calculation for UTUVL correctly remaps positions based on window grouping, following the same pattern as QWEN25VL but with parameterized merge_ratio which is cleaner.

3668-3669: LGTM!

The embedding dimension correctly uses mm_1_b->ne[0], consistent with tensor loading and similar projector types.

convert_hf_to_gguf.py (1)

1176-1177: UTUVL tokenizer hash entry looks consistent; just ensure runtime support exists.

The new chkhsh == "9d7013..." → res = "utu-vl" branch follows the same pattern as other models and should work as long as the C++ side and convert_hf_to_gguf_update.py have been updated to recognize "utu-vl" as a valid tokenizer.ggml.pre value. Nothing functionally wrong here from this script’s perspective; just double‑check the upstream/runtime side is in sync so this identifier isn’t treated as “unknown pre‑tokenizer”.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tools/mtmd/models/utuvl.cpp (1)
41-41: Minor: Inconsistent spacing in dimension calculation.

Line 41 has inconsistent spacing: 3*patch_size* patch_size should be 3 * patch_size * patch_size for consistency with the rest of the codebase.
Proposed fix
-        3*patch_size* patch_size,  Hm * Wm * m * m, 1);
+        3 * patch_size * patch_size, Hm * Wm * m * m, 1);
tools/mtmd/clip.cpp (1)
1165-1175: Consider adding guidance to the warning message.

The warning at line 1173-1174 only states the minimum requirement but doesn't provide guidance on how to fix it, unlike the Qwen-VL warning (lines 1160-1162) which suggests --image-min-tokens 1024.
Proposed enhancement
                     if (hparams.image_min_pixels < warn_min_pixels) {
                         LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                        LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                     }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8733bf3 and 1600974.

📒 Files selected for processing (2)

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

tools/mtmd/clip.cpp
tools/mtmd/models/utuvl.cpp

🧠 Learnings (2)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

tools/mtmd/clip.cpp

🔇 Additional comments (8)

tools/mtmd/models/utuvl.cpp (3)

64-83: LGTM!

The window attention setup correctly handles index remapping and mask creation, with appropriate F16 casting for flash attention compatibility.

86-138: LGTM!

The transformer layer implementation correctly handles:

Window/full attention alternation pattern (every 8th layer + final layer uses full attention)

Multi-head attention with separate Q/K/V projections

RoPE positional encoding via ggml_rope_multi

Proper residual connections and FFN processing

160-160: Verify the hardcoded epsilon value for merger RMS norm.

Line 160 uses a hardcoded 1e-6 for the RMS normalization epsilon, while other build_norm calls in this function use the eps member variable. This may be intentional if the VLPatchMerger requires a different epsilon, but it's worth confirming this matches the original model specification.

tools/mtmd/clip.cpp (5)

848-851: LGTM!

The graph builder selection for UTUVL follows the established pattern for other projector types.

1511-1518: LGTM!

The tensor loading correctly loads the merger normalization weight and MLP projection weights, matching the graph construction in utuvl.cpp.

2993-2994: LGTM!

The token counting logic for UTUVL correctly shares the same calculation as Qwen and GLM4V models, consistent with the 2x spatial merge configuration.

Also applies to: 3009-3010, 3070-3071

3597-3598: LGTM!

The embedding size resolution correctly uses mm_1_b->ne[0], consistent with the merger MLP output tensor loaded for UTUVL.

3345-3360: and

coderabbitai · 2025-12-23T12:18:05Z

tools/mtmd/clip.cpp

+                while (scale > 0.0f) {
+                    target_height = get_scaled_image_size(scale, original_size.height);
+                    target_width = get_scaled_image_size(scale, original_size.width);
+
+                    int num_patches_h = target_height / patch_size;
+                    int num_patches_w = target_width / patch_size;
+                    int num_patches = num_patches_h * num_patches_w;
+
+                    if (num_patches > max_num_patches) {
+                        scale -= 0.02f;
+                    } else {
+                        break;
+                    }
+                }


⚠️ Potential issue | 🟡 Minor

Misleading comment: this is a linear search, not binary search.

The comment says "Binary search with 0.02 step size" but the implementation is actually a linear search that decrements the scale by 0.02 until the patch count fits within limits. While this works correctly, the comment is inaccurate.

Proposed fix

- // Binary search with 0.02 step size + // Linear search: decrement scale by 0.02 until patches fit while (scale > 0.0f) {

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tools/mtmd/clip.cpp around lines 2733 to 2746, the existing comment incorrectly states "Binary search with 0.02 step size" while the code actually performs a linear decrement of scale by 0.02 until the number of patches fits; update the comment to accurately describe the behavior (e.g., "Linear search: decrement scale by 0.02 until patch count <= max_num_patches") or, if you prefer true binary search, replace the loop with a proper binary search over scale that narrows high/low bounds until the target resolution yields <= max_num_patches (ensure to preserve the 0.02 resolution or converge to a tolerance).

Support Youtu-VL Model

8733bf3

github-actions bot added examples python labels Dec 23, 2025

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

merge code

1600974

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mirror] Support Youtu-VL Model #67

[Mirror] Support Youtu-VL Model #67

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 23, 2025 •

edited

Loading

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Mirror] Support Youtu-VL Model #67

Are you sure you want to change the base?

[Mirror] Support Youtu-VL Model #67

Conversation

ngxson commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading