-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] Support Youtu-VL Model #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughAdds end-to-end UTUVL vision model support (conversion, GGUF params, projector, graph, preprocessing, tokenization, tensor mappings) and tie-word-embeddings handling for DeepseekV2; also updates Unicode \p{...} parsing and build units to include UTUVL. Changes
Sequence DiagramsequenceDiagram
actor User
participant Loader as ModelLoader
participant Registry as ModelRegistry
participant Projector as ProjectorBuilder
participant Preproc as ImageProcessor
participant Graph as VisionGraph
participant Tensors as TensorOps
User->>Registry: Request load UTUVL model
Registry->>Loader: Instantiate UtuVLVisionModel
Loader->>Projector: Recognize PROJECTOR_TYPE_UTUVL
Projector->>Graph: Build clip_graph_utuvl
User->>Preproc: Submit image
Preproc->>Preproc: Compute merge-aware patch grid & normalize
Preproc->>Graph: Provide image tensor
Note over Graph: Windowed attention / transformer pass
Graph->>Tensors: Apply RoPE & proj (Q/K/V)
Graph->>Tensors: Apply window mask & attention
Graph->>Tensors: Residuals & FFN per layer
Graph->>Tensors: VLPatchMerger (RMS, reshape, FFN)
Graph->>User: Return vision embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
convert_hf_to_gguf.py (1)
7247-7344: OUTPUT tensor bypass of quantization is intentional but undocumented; embedding naming assumption is reasonable for DeepseekV2.The tied-embeddings implementation in
DeepseekV2Modelis functionally correct but has two points worth clarifying:
Quantization bypass for OUTPUT is real: The code calls
self.gguf_writer.add_tensor(output_name, self._token_embd.numpy())directly, which bypasses the quantization pipeline inModelBase.prepare_tensors(). This means OUTPUT will be stored at the original tensor precision (likely F16) rather than following the model'sftypesetting (e.g., MOSTLY_Q8_0). This appears intentional for tied-embedding models to preserve logit precision, but should be documented if that's the design goal.Embedding naming assumption is reasonable: The check for
"model.embed_tokens.weight"is the correct naming for DeepseekV2 architecture variants. If future variants use different naming (e.g., without themodel.prefix), the OUTPUT tensor would not be written, but this is unlikely for the DeepseekV2/UTUVL family.If high-precision OUTPUT tensors are intended for quality, consider adding a comment explaining this trade-off. If OUTPUT should follow quantization settings instead, capture the quantized tensor from the parent
prepare_tensors()call rather than the raw_token_embd.
🧹 Nitpick comments (8)
src/llama-vocab.cpp (1)
1869-1873: Minor: Trailing whitespace on line 1873.The tokenizer configuration looks correct with
clean_spaces = falseandignore_merges = true, consistent with similar pre-tokenizers like llama3 and tekken.🔎 Proposed fix
} else if ( tokenizer_pre == "utu-vl") { pre_type = LLAMA_VOCAB_PRE_TYPE_UTU_VL; clean_spaces = false; - ignore_merges = true; + ignore_merges = true;tools/mtmd/models/utuvl.cpp (2)
160-160: Inconsistent epsilon value - uses hardcoded 1e-6 instead of membereps.The RMS norm for the merger uses a hardcoded
1e-6epsilon, while other norms in this file use the member variableeps. If this is intentional (perhaps the merger requires a different epsilon), consider adding a comment to clarify.🔎 Proposed fix (if same eps is intended)
- embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, 1e-6, -1); + embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, eps, -1);
173-185: Remove commented-out dead code.This block was replaced by the
build_ffncall above. Commented-out code should be removed to improve maintainability.🔎 Proposed fix
embeddings = build_ffn(embeddings, model.mm_0_w, model.mm_0_b, nullptr, nullptr, model.mm_1_w, model.mm_1_b, FFN_GELU, -1); - // // 3. First linear layer - // embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings); - // embeddings = ggml_add(ctx0, embeddings, model.mm_0_b); - // cb(embeddings, "merger_fc1", -1); - - // // 4. GELU activation - // embeddings = ggml_gelu(ctx0, embeddings); - // cb(embeddings, "merger_gelu", -1); - - // // 5. Second linear layer - // embeddings = ggml_mul_mat(ctx0, model.mm_1_w, embeddings); - // embeddings = ggml_add(ctx0, embeddings, model.mm_1_b); // build the graphtools/mtmd/clip.cpp (3)
1165-1175: Consider aligning warning messages with QWEN25VL for consistency.The warning for UTUVL is more terse compared to QWEN25VL which provides additional context about the
--image-min-tokensflag and a reference link. Also, QWEN25VL loadsn_wa_patternfrom config while UTUVL hardcodes window attention behavior in the batch encoder.If window attention pattern should be configurable, consider loading it from the model config similar to QWEN25VL.
🔎 Suggested improvement for warning message
const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size; if (hparams.image_min_pixels < warn_min_pixels) { LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__); + LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__); }
2710-2760: Potential edge cases in scaling loop.A few observations about the preprocessing logic:
Loop termination: The
while (scale > 0.0f)loop could theoretically reach very small or zero scale values for extremely large images. Consider adding a minimum scale threshold to prevent potential issues.Fallback value: When
params.image_max_pixelsis 0,max_num_patchesbecomes 0, which would always trigger the scale reduction.Comment accuracy: The comment mentions "binary search" but the implementation is a linear search with 0.02 decrements.
🔎 Suggested defensive bounds
- // Binary search for optimal scale to fit within max_num_patches + // Linear search for optimal scale to fit within max_num_patches float scale = 1.0f; int target_height = original_size.height; int target_width = original_size.width; auto get_scaled_image_size = [align_size](float scale, int size) -> int { float scaled_size = size * scale; // Round up to nearest multiple of align_size int aligned = static_cast<int>(std::ceil(scaled_size / align_size)) * align_size; // Ensure at least one patch return std::max(align_size, aligned); }; - // Binary search with 0.02 step size - while (scale > 0.0f) { + // Linear search with 0.02 step size + const float min_scale = 0.02f; + while (scale >= min_scale) { target_height = get_scaled_image_size(scale, original_size.height); target_width = get_scaled_image_size(scale, original_size.width); int num_patches_h = target_height / patch_size; int num_patches_w = target_width / patch_size; int num_patches = num_patches_h * num_patches_w; if (num_patches > max_num_patches) { scale -= 0.02f; } else { break; } }
3426-3476: Dead code and hardcoded window attention.
Dead code: Since
use_window_attnis hardcoded totrue(line 3428), theelsebranch at lines 3472-3475 will never execute.Consider making window attention configurable: Unlike QWEN25VL which reads
n_wa_patternfrom model config to determine window attention behavior, UTUVL hardcodes it. If this is intentional (UTUVL always requires window attention), consider removing the dead else branch or adding a comment explaining this design choice.🔎 Remove dead code
case PROJECTOR_TYPE_UTUVL: { - const bool use_window_attn = true; const int merge_ratio = 2; const int pw = image_size_width / patch_size / merge_ratio; // patches after merger const int ph = image_size_height / patch_size / merge_ratio; const int ipw = image_size_width / patch_size; // internal patches in ViT const int iph = image_size_height / patch_size; std::vector<int> idx (ph * pw); std::vector<int> inv_idx(ph * pw); - if (use_window_attn) { + // UTUVL always uses window attention + { const int attn_window_size = patch_size * 2 * 8; // ... rest of window attention code ... - } else { - for (int i = 0; i < ph * pw; i++) { - idx[i] = i; - } }convert_hf_to_gguf.py (2)
7138-7237: DeepseekV2 MoE GGUF metadata wiring for UTUVL is reasonable but a bit implicit.
- Registering
"UTUVLForCausalLM"onDeepseekV2Modelmatches how other multimodal families register both text and mmproj variants under the same HF arch name; no issue there.- In
set_gguf_parameters, the new MoE section:
- Falls back from
moe_intermediate_sizetointermediate_size(or0) forexpert_feed_forward_length, which keeps GGUF consistent for both MoE and dense-only DeepseekV2‑style configs.- Writes
expert_count,expert_shared_count(default0),expert_weights_scale(default1.0), andexpert_weights_normwhennorm_topk_probis truthy. This mirrors how other MoE models in this file are handled and should be safe.
Two minor nits you might want to consider (not blockers):- If a model is not MoE but happens to have an
intermediate_size, you’ll still emitexpert_feed_forward_lengthwith a non‑zero value; if that’s unintended, consider guarding all MoE metadata behind a singleif hparams.get("n_routed_experts") is not None:block.add_expert_weights_normis called withhparams["norm_topk_prob"]which appears to be boolean in Deepseek configs; if any future variant uses this field differently, this call might need to be revisited.Given this mirrors upstream behavior, I’d leave it as‑is unless you see non‑MoE DeepseekV2 models hitting this path.
10501-10541: UtuVLVisionModel implementation looks sound; small polish opportunities only.Behavior:
Registration under both
"UtuVLForConditionalGeneration"and"UTUVLForCausalLM"on anMmprojModelmirrors how other VL families (e.g. Qwen2VL) expose both text and vision variants; this is consistent.
set_gguf_parameters:
- Calls
super().set_gguf_parameters()so all generic vision/audio metadata is preserved.- Sets projector type to
UTUVL.- Uses
self.hparams["layer_norm_eps"](inherited fromMmprojModel, which rebindsself.hparamsto the vision config) and a tolerant activation check:
- Treats several GELU aliases (
gelu_pytorch_tanh, etc.) asvision_use_gelu=True.- Supports
siluviavision_use_silu=True.- Raises
ValueErrorfor anything else, which is an appropriate hard‑fail given unknown activations would break runtime.- Writes
vision_spatial_merge_sizewith a default of2when missing, which is at least explicit and will not crash.
modify_tensors:
- Cleanly drops obvious language‑model tensors (
lm_head.,model.layers.,model.embed_tokens.,model.norm.).- For everything else, delegates to
self.map_tensor_name(name)and, on failure, logs a warning and silently drops the tensor. That’s in line with other vision subclasses that prefer “best‑effort + warning” over hard failure.Optional polish (not required):
- Ruff’s TRY003/TRY300 hints here are stylistic only:
- The
ValueError(f"Unsupported activation function for UTUVL: {hidden_act}")could be shortened slightly, but it’s already clear and localized.- The
try/except ValueErrorinmodify_tensorsis fine as written; splitting thereturn []into anelseblock would be marginally clearer but not necessary.From a functional standpoint, this class fits well with the existing multimodal patterns in this file.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/llama-vocab.cppsrc/llama-vocab.hsrc/unicode.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip.cpptools/mtmd/models/models.htools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
src/llama-vocab.htools/mtmd/models/models.hsrc/llama-vocab.cpptools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cppsrc/unicode.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
src/llama-vocab.hgguf-py/gguf/constants.pytools/mtmd/models/models.hsrc/llama-vocab.cpptools/mtmd/models/utuvl.cpptools/mtmd/mtmd.cppsrc/unicode.cpptools/mtmd/clip.cppgguf-py/gguf/tensor_mapping.pytools/mtmd/clip-impl.hconvert_hf_to_gguf.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/llama-vocab.cppsrc/unicode.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/CMakeLists.txttools/mtmd/mtmd.cpptools/mtmd/clip.cpptools/mtmd/clip-impl.h
🧬 Code graph analysis (3)
tools/mtmd/models/models.h (1)
tools/mtmd/models/utuvl.cpp (2)
build(3-190)build(3-3)
tools/mtmd/models/utuvl.cpp (1)
ggml/src/ggml.c (9)
ggml_reshape_4d(3583-3601)ggml_permute(3700-3752)ggml_cont_4d(3492-3508)ggml_mul_mat(3174-3189)ggml_add(1969-1974)ggml_reshape_2d(3546-3562)ggml_set_name(1841-1848)ggml_cast(3433-3445)ggml_build_forward_expand(6793-6795)
convert_hf_to_gguf.py (3)
gguf-py/gguf/gguf_writer.py (10)
add_expert_feed_forward_length(714-715)add_expert_count(800-801)add_expert_shared_count(806-807)add_expert_weights_scale(815-816)add_expert_weights_norm(818-819)add_clip_projector_type(1083-1084)add_vision_attention_layernorm_eps(1104-1105)add_vision_use_gelu(1122-1123)add_vision_use_silu(1125-1126)add_vision_spatial_merge_size(1119-1120)convert_lora_to_gguf.py (1)
modify_tensors(448-473)gguf-py/gguf/constants.py (1)
VisionProjectorType(3414-3435)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
10521-10521: Avoid specifying long messages outside the exception class
(TRY003)
10536-10536: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
- GitHub Check: build-cmake-pkg / linux
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: ubuntu-24-cmake-vulkan-deb
- GitHub Check: ubuntu-24-cmake-webgpu
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: ios-xcode-build
- GitHub Check: macOS-latest-cmake-ios
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: macOS-latest-cmake-arm64
🔇 Additional comments (19)
src/unicode.cpp (2)
967-971: LGTM! Unicode letter subcategories correctly added.The addition of Lu, Ll, Lt, Lm, and Lo subcategories mapping to the LETTER flag is correct and follows the existing pattern. The inline comments clearly document each subcategory type.
1082-1103: LGTM! Dynamic \p{...} pattern resolution is well-implemented.The replacement of fixed-length pattern matching with dynamic resolution correctly:
- Searches for the closing brace dynamically (Line 1087)
- Validates the pattern length with a reasonable limit (Line 1088)
- Conditionally inserts brackets to avoid nesting when not inside a character class (Lines 1091-1098)
- Advances the loop index correctly past the matched pattern (Line 1099)
The boundary checks are sound, and the logic handles varying-length Unicode property patterns correctly.
gguf-py/gguf/constants.py (1)
3414-3435: UTUVL projector string is consistent with existing naming
VisionProjectorType.UTUVL = "utuvl"aligns with the new C++ projector type name and other projector strings; no issues here.gguf-py/gguf/tensor_mapping.py (1)
1218-1458: UTUVL/Siglip2 tensor mappings look consistent; validate against real checkpointsThe added
"merger.*"and"siglip2.vision_model.*"source keys for V_MMPROJ, V_ENC_* and related tensors are consistent with existing patterns for other vision models and correctly target the correspondingMODEL_TENSORentries. Behavior remains purely additive.It would still be good to run at least one UTUVL HF → GGUF conversion and load in mtmd to confirm all expected tensors are picked up and no additional Siglip2 variants are missing.
tools/mtmd/CMakeLists.txt (1)
5-31: UTUVL model source correctly wired into mtmd targetAdding
models/utuvl.cppto themtmdlibrary matches the pattern of other model sources and is sufficient to include the new graph implementation in the build.tools/mtmd/clip-impl.h (1)
163-221: Enum and string mapping for UTUVL are correctly aligned
PROJECTOR_TYPE_UTUVLis added toprojector_typeand mapped to"utuvl"inPROJECTOR_TYPE_NAMES, matching the Python-side constant and gguf metadata string. No issues seen.src/llama-vocab.h (1)
9-55: New UTU_VL pre-tokenizer enum value is fineAppending
LLAMA_VOCAB_PRE_TYPE_UTU_VL = 43tollama_vocab_pre_typeis consistent with existing entries and keeps ordinal values stable. Assuming the corresponding handling is implemented inllama-vocab.cpp, this looks good.tools/mtmd/mtmd.cpp (1)
268-290: UTUVL reuses Qwen-style vision delimiters correctlyIncluding
PROJECTOR_TYPE_UTUVLin the Qwen2VL/Qwen25VL/Qwen3VL branch makes UTUVL use<|vision_start|>/<|vision_end|>markers, which is consistent with how other Qwen-based VL projectors are handled.Please double‑check that the UTUVL text tokenizer indeed defines these exact special tokens.
tools/mtmd/models/models.h (1)
25-28: UTUVL clip graph declaration matches existing pattern
clip_graph_utuvlfollows the same inheritance and constructor pattern as otherclip_graph_*types and declares a virtualbuild()override, which fits the documented implementation inutuvl.cpp.src/llama-vocab.cpp (1)
317-322: LGTM - UTU_VL regex patterns are well-structured.The regex expressions correctly handle Korean Hangul, CJK punctuation, Bopomofo, and Chinese/Japanese characters, followed by a comprehensive pattern for mixed-case words with contractions. This follows the established pattern for other multilingual tokenizers in the codebase.
tools/mtmd/models/utuvl.cpp (3)
3-14: LGTM - Clean initialization with appropriate assertions.The setup correctly initializes window attention parameters. The
batch_size = 1constraint is properly enforced viaGGML_ASSERTstatements later in the code (lines 79, 146).
18-49: LGTM - Window partitioning logic for patch embeddings.The sequence of reshape/permute/cont operations correctly reorganizes patches into a windowed format for efficient attention computation. The math appears consistent: patches are organized into 2×2 windows (m=2) and the final shape matches the expected
(3*patch_size*patch_size, Hm*Wm*m*m, 1)before the patch embedding projection.
86-138: LGTM - Standard transformer layer implementation with windowed attention.The layer loop correctly implements:
- Pre-norm architecture with residual connections
- Q/K/V projections with RoPE positional encoding
- Windowed attention with full attention every 8th layer and on the last layer (
(il + 1) % 8 == 0 || il == n_layer - 1)- FFN with configurable activation
tools/mtmd/clip.cpp (5)
848-851: LGTM!The UTUVL case follows the established pattern for other projector types in the graph builder switch statement.
1511-1518: LGTM!Tensor loading correctly loads UTUVL-specific tensors including the RMS normalization weight for the merger.
2993-2994: LGTM!Token counting functions correctly include UTUVL alongside related projector types (QWEN2VL, QWEN25VL, QWEN3VL, GLM4V) that share the same merge-based output token calculation.
Also applies to: 3009-3010, 3070-3076
3477-3496: LGTM!Position calculation for UTUVL correctly remaps positions based on window grouping, following the same pattern as QWEN25VL but with parameterized
merge_ratiowhich is cleaner.
3668-3669: LGTM!The embedding dimension correctly uses
mm_1_b->ne[0], consistent with tensor loading and similar projector types.convert_hf_to_gguf.py (1)
1176-1177: UTUVL tokenizer hash entry looks consistent; just ensure runtime support exists.The new
chkhsh == "9d7013..."→res = "utu-vl"branch follows the same pattern as other models and should work as long as the C++ side andconvert_hf_to_gguf_update.pyhave been updated to recognize"utu-vl"as a validtokenizer.ggml.prevalue. Nothing functionally wrong here from this script’s perspective; just double‑check the upstream/runtime side is in sync so this identifier isn’t treated as “unknown pre‑tokenizer”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tools/mtmd/models/utuvl.cpp (1)
41-41: Minor: Inconsistent spacing in dimension calculation.Line 41 has inconsistent spacing:
3*patch_size* patch_sizeshould be3 * patch_size * patch_sizefor consistency with the rest of the codebase.Proposed fix
- 3*patch_size* patch_size, Hm * Wm * m * m, 1); + 3 * patch_size * patch_size, Hm * Wm * m * m, 1);tools/mtmd/clip.cpp (1)
1165-1175: Consider adding guidance to the warning message.The warning at line 1173-1174 only states the minimum requirement but doesn't provide guidance on how to fix it, unlike the Qwen-VL warning (lines 1160-1162) which suggests
--image-min-tokens 1024.Proposed enhancement
if (hparams.image_min_pixels < warn_min_pixels) { LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__); + LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__); }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/clip.cpptools/mtmd/models/utuvl.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpp
🔇 Additional comments (8)
tools/mtmd/models/utuvl.cpp (3)
64-83: LGTM!The window attention setup correctly handles index remapping and mask creation, with appropriate F16 casting for flash attention compatibility.
86-138: LGTM!The transformer layer implementation correctly handles:
- Window/full attention alternation pattern (every 8th layer + final layer uses full attention)
- Multi-head attention with separate Q/K/V projections
- RoPE positional encoding via
ggml_rope_multi- Proper residual connections and FFN processing
160-160: Verify the hardcoded epsilon value for merger RMS norm.Line 160 uses a hardcoded
1e-6for the RMS normalization epsilon, while otherbuild_normcalls in this function use theepsmember variable. This may be intentional if the VLPatchMerger requires a different epsilon, but it's worth confirming this matches the original model specification.tools/mtmd/clip.cpp (5)
848-851: LGTM!The graph builder selection for UTUVL follows the established pattern for other projector types.
1511-1518: LGTM!The tensor loading correctly loads the merger normalization weight and MLP projection weights, matching the graph construction in
utuvl.cpp.
2993-2994: LGTM!The token counting logic for UTUVL correctly shares the same calculation as Qwen and GLM4V models, consistent with the 2x spatial merge configuration.
Also applies to: 3009-3010, 3070-3071
3597-3598: LGTM!The embedding size resolution correctly uses
mm_1_b->ne[0], consistent with the merger MLP output tensor loaded for UTUVL.
3345-3360:and
| while (scale > 0.0f) { | ||
| target_height = get_scaled_image_size(scale, original_size.height); | ||
| target_width = get_scaled_image_size(scale, original_size.width); | ||
|
|
||
| int num_patches_h = target_height / patch_size; | ||
| int num_patches_w = target_width / patch_size; | ||
| int num_patches = num_patches_h * num_patches_w; | ||
|
|
||
| if (num_patches > max_num_patches) { | ||
| scale -= 0.02f; | ||
| } else { | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading comment: this is a linear search, not binary search.
The comment says "Binary search with 0.02 step size" but the implementation is actually a linear search that decrements the scale by 0.02 until the patch count fits within limits. While this works correctly, the comment is inaccurate.
Proposed fix
- // Binary search with 0.02 step size
+ // Linear search: decrement scale by 0.02 until patches fit
while (scale > 0.0f) {Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/mtmd/clip.cpp around lines 2733 to 2746, the existing comment
incorrectly states "Binary search with 0.02 step size" while the code actually
performs a linear decrement of scale by 0.02 until the number of patches fits;
update the comment to accurately describe the behavior (e.g., "Linear search:
decrement scale by 0.02 until patch count <= max_num_patches") or, if you prefer
true binary search, replace the loop with a proper binary search over scale that
narrows high/low bounds until the target resolution yields <= max_num_patches
(ensure to preserve the 0.02 resolution or converge to a tolerance).
Mirror from upstream PR: ggml-org#18315
Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.