-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] mtmd: Add DeepSeekOCR Support #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
init commit
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
…ut in deepseek2 model
…e image decoding fails
…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp
python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown
# Conflicts: # convert_hf_to_gguf.py
# Conflicts: # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # tools/mtmd/clip-impl.h # tools/mtmd/clip.cpp # tools/mtmd/models/models.h
# Conflicts: # src/llama-arch.cpp
WalkthroughIntroduces comprehensive DeepSeek-OCR model support across the inference engine and tooling. Adds architecture definitions, model conversion logic with vision model handler, GGUF serialization support, CLIP vision integration with SAM attention mechanism, chat template, and testing infrastructure for OCR-capable vision models. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The changes introduce a new model family across heterogeneous subsystems (conversion tooling, inference engine, vision models) with multiple interacting components: DeepseekOCRVisionModel for conversion, OCR-specific tensor handling and LoRA/MLA skipping in model loading, SAM windowed attention mechanism with relative position embeddings, image preprocessing pipelines with bicubic resampling, and extensive GGUF serialization support. While individual file changes are often straightforward, the cumulative scope spans vision model integration, attention mechanism variants, and architecture-specific branching in multiple critical paths, requiring careful verification of tensor mapping consistency, attention computation correctness, and image preprocessing alignment with reference implementations. Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (14)
ggml/src/ggml-cuda/upscale.cu (1)
292-293: Good defensive programming; consider a more descriptive error message.The fallback abort is appropriate and prevents undefined behavior when an unrecognized scale mode is encountered.
However, the error message could be more helpful for debugging by including the actual mode value.
🔎 Suggested improvement for error message
} else { - GGML_ABORT("fatal error"); + GGML_ABORT("unsupported upscale mode: %d", mode); }ggml/src/ggml.c (2)
4906-4920: New F32-only precondition inggml_interpolate_implmay reject non‑F32 callers
GGML_ASSERT(a->type == GGML_TYPE_F32);tightens the contract for allggml_upscale*/ggml_interpolate*paths to F32 input. That is consistent with most existing kernels (which are F32‑centric), but it will now abort in debug builds if any existing or third‑party code feeds F16/BF16 tensors here.Given this is a mirrored upstream change, it’s probably fine, but it’s worth double‑checking that:
- No current llama.cpp callsites rely on non‑F32 interpolate/upscale.
- External bindings (Rust, Python, etc.) aren’t exposing
ggml_interpolateagainst F16/BF16 tensors.If any such usage exists, consider either inserting an explicit cast to F32 at callsites or relaxing the assert (e.g., allowing F16/BF16 and handling conversion in the backend).
Based on learnings, this should be relayed upstream if it breaks callers.
5256-5269: Flash-attn mask is now hard‑required to be F16 – confirm all callsites complyThe new
GGML_ASSERT(mask->type == GGML_TYPE_F16);inggml_flash_attn_extaligns with CUDA/Metal flash‑attn kernels, which already require F16 masks, but it’s a behavior change at the graph‑building layer:
- Previously, masks could be F32 in some codepaths (and are still allowed as F16/F32 for generic softmax masks).
- With this assert, any remaining F32 masks will now abort early instead of failing deeper in the backend.
Please verify that:
- All internal llama.cpp users of
ggml_flash_attn_extnow construct F16 masks (especially any older text‑only attention paths and non‑CUDA backends).- External bindings or downstream forks are not passing F32 masks.
If there is still mixed usage, one option would be to accept F32 masks here and explicitly cast them to F16 before invoking the flash‑attn backend; otherwise this assert is a good guard but may be a subtle breaking change for some integrations.
As per coding guidelines for mirrored PRs, this is likely something to bounce to the upstream authors if you see failures.
src/llama-kv-cache.cpp (1)
1403-1409: Commented YaRN attn_factor override is inert and safeThe added DeepSeek2/DEEPSEEK2OCR-specific YaRN
attn_factorsnippet is fully commented out and only documents a potential future override; it does not affect current behavior ofbuild_rope_shift(). Keeping it as reference is fine in this mirrored PR.tools/mtmd/clip-impl.h (1)
481-505: Harden debug helpers against null sources and unexpected tensor typesThe new debugging utilities are useful, but there are a couple of edge cases worth guarding:
print_tensor_infoassumest->src[0]is always non-null; if it’s ever called on a leaf tensor, this will dereference a null pointer.print_tensor_sum,print_tensor_data, andsave_tensor_to_fileGGML_ABORTon non-{F16,F32,I{8,16,32}} types; accidentally calling them on quantized tensors (common in this codebase) would crash the process, which is a bit harsh for diagnostics.A small defensive tweak keeps behavior robust while preserving current usage.
Proposed defensive changes for debug helpers
static std::string to_ne_string(const ggml_tensor * t) { @@ -static void print_tensor_info(ggml_tensor * t) { - const struct ggml_tensor * src0 = t->src[0]; - const struct ggml_tensor * src1 = t->src[1]; +static void print_tensor_info(ggml_tensor * t) { + const struct ggml_tensor * src0 = t->src[0]; + const struct ggml_tensor * src1 = t->src[1]; + + if (!src0) { + printf("%s: %s (no src0)\n", t->name, ggml_type_name(t->type)); + return; + } @@ - printf("%s: %s = %s(%s{%s}, %s)\n", - t->name, ggml_type_name(t->type), ggml_op_desc(t), - src0->name, to_ne_string(src0).c_str(), - src1 ? src1_str : ""); + printf("%s: %s = %s(%s{%s}, %s)\n", + t->name, ggml_type_name(t->type), ggml_op_desc(t), + src0->name, to_ne_string(src0).c_str(), + src1 ? src1_str : ""); } @@ - for (int64_t i3 = 0; i3 < ne[3]; i3++) { + for (int64_t i3 = 0; i3 < ne[3]; i3++) { @@ - } else { - GGML_ABORT("fatal error"); - } + } else { + // unsupported debug type, skip rather than abort + continue; + } @@ - printf("%s.sum = %.6f\n", t->name, sum); + printf("%s.sum = %.6f\n", t->name, sum); } @@ - } else { - GGML_ABORT("fatal error"); - } + } else { + // unsupported debug type, skip rather than abort + continue; + } @@ - const uint8_t * data = (data_ptr) ? data_ptr : (uint8_t *) tensor->data; + const uint8_t * data = (data_ptr) ? data_ptr : (const uint8_t *) tensor->data; @@ - } else { - GGML_ABORT("fatal error"); - } + } else { + // unsupported debug type, skip rather than abort + continue; + }If you prefer to stay byte-for-byte with upstream in this mirror PR, consider at least the null-check for
src0and revisiting theGGML_ABORTpaths the next time these helpers are touched. Based on learnings, this can then be relayed back upstream.Also applies to: 518-549, 551-604, 606-719
tools/mtmd/tests/tests-requirements.txt (1)
1-5: Consider pinning dependency versions for reproducible testsThe added dependencies are appropriate for OCR testing with embedding similarity evaluation. However, consider pinning versions (e.g.,
sentence-transformers>=2.0.0,<3.0.0) to ensure reproducible test environments across different setups and prevent unexpected breakage from upstream changes.src/llama-model.cpp (1)
6951-6962: Optional: consider clarifying print_info for OCR/lite DeepSeek2 variantsCurrently
print_info()logsn_lora_qandn_lora_kvfor bothLLM_ARCH_DEEPSEEK2andLLM_ARCH_DEEPSEEK2OCR, even though OCR and “lite” variants intentionally skip loading some or all of these ranks (they’ll stay at 0). That’s correct but slightly opaque; if you want to be extra clear for users debugging models, you could either skip logging them for OCR, or add a brief comment/note that zero here is expected for OCR and lite configs.tools/mtmd/models/deepseekocr.cpp (1)
98-102: Consider adding an assertion for head dimension divisibility.The calculation
d_heads = n_embd / n_headsassumes even divisibility. While this is typically guaranteed by model configuration, an assertion could catch misconfiguration early.🔎 Optional defensive assertion
const int n_embd = hparams.sam_n_embd; const int n_layer = hparams.sam_n_layer; const int n_heads = hparams.sam_n_head; + GGML_ASSERT(n_embd % n_heads == 0 && "n_embd must be divisible by n_heads"); const int d_heads = n_embd / n_heads; const int window = hparams.attn_window_size;tools/mtmd/tests/test-deepseek-ocr.py (1)
126-126: Misleading step numbering: "[2/3]" without a "[1/3]".The print statement indicates step 2 of 3, but there's no "[1/3]" step visible. Consider either adding a "[1/3] Validating paths..." step or adjusting the numbering.
convert_hf_to_gguf.py (1)
1698-1698: AnnotateMmprojModel.n_block_keysasClassVar(RUF012)
n_block_keysis a mutable class attribute and Ruff is flagging it. Declaring it as aClassVar(or making it an immutable tuple) will address the lint and better reflect intent.Proposed minimal change
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast +from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar @@ class MmprojModel(ModelBase): @@ - n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"] + n_block_keys: ClassVar[list[str]] = [ + "n_layers", + "num_hidden_layers", + "n_layer", + "num_layers", + "depth", + "layers", + "encoder_layers", + ]tools/mtmd/clip.cpp (4)
1200-1210: Minor formatting: inconsistent indentation at closing brace.Line 1210 has an extra leading space before
} break;compared to other case blocks in this switch statement.Proposed fix
get_u32(KEY_ATTN_WINDOW_SIZE, hparams.attn_window_size, true); - } break; + } break;
2399-2419: Consider adding input validation for edge cases.The
resize_bicubic_pillowfunction doesn't validate thattarget_widthandtarget_heightare positive before proceeding. While the caller (resize()) should provide valid dimensions, defensive checks would prevent potential division-by-zero or empty buffer issues.Proposed validation
static bool resize_bicubic_pillow(const clip_image_u8 & img, clip_image_u8 & dst, int target_width, int target_height) { + if (target_width <= 0 || target_height <= 0 || img.nx <= 0 || img.ny <= 0) { + return false; + } + // Fixed-point precision: 22 bits = 32 (int32_t) - 8 (uint8_t pixels) - 2 (headroom for accumulation)
2990-2998: Float equality comparison may have precision issues.At line 2993,
else if (diff == best_diff)compares floating-point values for exact equality. Due to floating-point precision, this comparison might not trigger when mathematically expected, or might trigger unexpectedly. Consider using a small epsilon for comparison.However, if this is intentionally mirroring Python's behavior for compatibility, this may be acceptable.
3268-3295: Consider using existingimg_toolhelpers for padding and compositing.The manual padding fill loop (lines 3275-3280) and image copy loop (lines 3287-3295) duplicate functionality already available in
img_tool::fill()andimg_tool::composite(). Using these helpers would improve consistency and reduce code duplication.Proposed refactor
clip_image_u8_ptr padded_img(clip_image_u8_init()); padded_img->nx = image_size; padded_img->ny = image_size; - padded_img->buf.resize(image_size * image_size * 3); // black padding + padded_img->buf.resize(image_size * image_size * 3); - // Fill with mean color - for (int i = 0; i < image_size * image_size; ++i) - { - padded_img->buf[i * 3 + 0] = pad_r; - padded_img->buf[i * 3 + 1] = pad_g; - padded_img->buf[i * 3 + 2] = pad_b; - } + // Fill with mean color + img_tool::fill(*padded_img, {pad_r, pad_g, pad_b}); // Calculate padding offsets (center the image) int pad_x = (image_size - new_w) / 2; int pad_y = (image_size - new_h) / 2; - // Copy scaled image into padded canvas - for (int y = 0; y < new_h; ++y){ - for (int x = 0; x < new_w; ++x){ - int src_idx = (y * new_w + x) * 3; - int dst_idx = ((y + pad_y) * image_size + (x + pad_x)) * 3; - padded_img->buf[dst_idx + 0] = scaled_img->buf[src_idx + 0]; - padded_img->buf[dst_idx + 1] = scaled_img->buf[src_idx + 1]; - padded_img->buf[dst_idx + 2] = scaled_img->buf[src_idx + 2]; - } - } + // Copy scaled image into padded canvas + img_tool::composite(*padded_img, *scaled_img, pad_x, pad_y);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (30)
convert_hf_to_gguf.pyggml/src/ggml-cuda/upscale.cuggml/src/ggml.cgguf-py/gguf/constants.pygguf-py/gguf/gguf_writer.pygguf-py/gguf/tensor_mapping.pysrc/llama-arch.cppsrc/llama-arch.hsrc/llama-chat.cppsrc/llama-chat.hsrc/llama-graph.cppsrc/llama-kv-cache.cppsrc/llama-model.cppsrc/llama-vocab.cppsrc/models/deepseek2.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/clip.htools/mtmd/models/deepseekocr.cpptools/mtmd/models/glm4v.cpptools/mtmd/models/models.htools/mtmd/models/siglip.cpptools/mtmd/mtmd.cpptools/mtmd/tests.shtools/mtmd/tests/test-1-extracted.mdtools/mtmd/tests/test-1-extracted.txttools/mtmd/tests/test-deepseek-ocr.pytools/mtmd/tests/tests-requirements.txt
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
src/llama-arch.htools/mtmd/clip.hsrc/llama-chat.htools/mtmd/models/siglip.cppsrc/llama-vocab.cppsrc/models/deepseek2.cpptools/mtmd/models/models.hsrc/llama-chat.cpptools/mtmd/models/deepseekocr.cppsrc/llama-kv-cache.cpptools/mtmd/mtmd.cpptools/mtmd/clip.cppsrc/llama-graph.cpptools/mtmd/models/glm4v.cpptools/mtmd/clip-model.hsrc/llama-arch.cppsrc/llama-model.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
src/llama-arch.htools/mtmd/clip.hsrc/llama-chat.htools/mtmd/models/siglip.cppsrc/llama-vocab.cppsrc/models/deepseek2.cpptools/mtmd/models/models.hgguf-py/gguf/tensor_mapping.pytools/mtmd/tests/test-deepseek-ocr.pysrc/llama-chat.cpptools/mtmd/models/deepseekocr.cppsrc/llama-kv-cache.cppgguf-py/gguf/gguf_writer.pytools/mtmd/mtmd.cpptools/mtmd/clip.cppgguf-py/gguf/constants.pysrc/llama-graph.cpptools/mtmd/models/glm4v.cpptools/mtmd/clip-model.hsrc/llama-arch.cppsrc/llama-model.cppconvert_hf_to_gguf.pytools/mtmd/clip-impl.h
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/llama-vocab.cppsrc/models/deepseek2.cppsrc/llama-chat.cppsrc/llama-kv-cache.cppsrc/llama-graph.cppsrc/llama-arch.cppsrc/llama-model.cpp
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/tensor_mapping.pytools/mtmd/tests/test-deepseek-ocr.pygguf-py/gguf/gguf_writer.pygguf-py/gguf/constants.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
src/llama-arch.hsrc/llama-kv-cache.cppsrc/llama-graph.cppsrc/llama-arch.cppsrc/llama-model.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical
Applied to files:
src/llama-arch.h
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/tensor_mapping.pytools/mtmd/mtmd.cpptools/mtmd/clip.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.h
🧬 Code graph analysis (8)
tools/mtmd/clip.h (1)
tools/mtmd/clip.cpp (4)
clip_is_deepseekocr(4080-4082)clip_is_deepseekocr(4080-4080)ctx(2690-2793)ctx(2690-2690)
tools/mtmd/models/siglip.cpp (2)
ggml/src/ggml.c (1)
ggml_mul_mat(3174-3189)tools/mtmd/clip.cpp (5)
model(218-220)model(937-1266)model(937-937)model(1982-1995)model(1982-1982)
tools/mtmd/models/models.h (2)
tools/mtmd/clip.cpp (14)
clip_graph(227-254)clip_ctx(165-208)clip_ctx(210-215)ctx(2690-2793)ctx(2690-2690)img(1786-1786)img(2283-2289)img(2283-2283)img(2329-2390)img(2329-2329)img(2399-2639)img(2399-2399)img(2795-2827)img(2795-2795)tools/mtmd/models/deepseekocr.cpp (2)
build(91-324)build(91-91)
tools/mtmd/models/deepseekocr.cpp (1)
ggml/src/ggml.c (9)
ggml_reshape_4d(3583-3601)ggml_cont(3461-3465)ggml_permute(3700-3752)ggml_view_4d(3676-3696)ggml_reshape_3d(3564-3581)ggml_interpolate(4949-4958)ggml_get_rows(3776-3797)ggml_cast(3433-3445)ggml_concat(2517-2544)
gguf-py/gguf/gguf_writer.py (1)
gguf-py/gguf/constants.py (3)
Keys(20-332)ClipVision(282-309)SAM(306-309)
src/llama-model.cpp (1)
src/llama-model-loader.cpp (2)
create_tensor(794-815)create_tensor(794-794)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (8)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)add_vision_use_gelu(1118-1119)add_vision_projector_scale_factor(1124-1125)add_vision_window_size(1115-1116)add_vision_sam_layers_count(1133-1134)add_vision_sam_embedding_length(1136-1137)add_vision_sam_head_count(1139-1140)gguf-py/gguf/constants.py (3)
VisionProjectorType(3491-3512)GGMLQuantizationType(3366-3398)MODEL_ARCH(347-456)
tools/mtmd/clip-impl.h (1)
ggml/src/ggml.c (3)
ggml_type_name(1282-1284)ggml_op_desc(1306-1316)ggml_nelements(1224-1228)
🪛 LanguageTool
tools/mtmd/tests/test-1-extracted.md
[grammar] ~48-~48: Use a hyphen to join words.
Context: ...det|> APOLLO CONTROL: The most major sky to sky will be for the 23 event, that is...
(QB_NEW_EN_HYPHEN)
[grammar] ~48-~48: Use a hyphen to join words.
Context: ...|> APOLLO CONTROL: The most major sky to sky will be for the 23 event, that is at...
(QB_NEW_EN_HYPHEN)
🪛 Ruff (0.14.10)
tools/mtmd/tests/test-deepseek-ocr.py
41-41: subprocess call: check for execution of untrusted input
(S603)
51-51: Avoid specifying long messages outside the exception class
(TRY003)
convert_hf_to_gguf.py
1698-1698: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
5998-5998: Avoid specifying long messages outside the exception class
(TRY003)
6008-6008: Unused method argument: new_name
(ARG002)
6008-6008: Unused method argument: bid
(ARG002)
6008-6008: Unused method argument: n_dims
(ARG002)
6015-6015: Unused method argument: bid
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: windows-2022-cmake-cuda (12.4)
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: windows-latest-cmake-hip
- GitHub Check: macOS-latest-cmake-visionos
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-24-cmake-vulkan
- GitHub Check: ubuntu-22-cmake-hip
- GitHub Check: ubuntu-24-wasm-webgpu
- GitHub Check: macOS-latest-cmake-arm64-webgpu
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: server-windows
- GitHub Check: pyright type-check
🔇 Additional comments (41)
tools/mtmd/tests.sh (3)
156-156: LGTM!The flash attention parameter is correctly forwarded to the test command with proper quoting.
111-112: Verify that test validation logic works for OCR-specific tests.These new OCR models use prompts like "Free OCR." and "extract all texts from this image", but the test validation logic (lines 169-175) still checks for "new york" or ("men" AND "walk"). This validation assumes the test image contains specific content. Ensure that the default test image (
test-1.jpeg) actually contains text that would make OCR extraction meaningful and that the validation criteria remain appropriate.
111-111: Change--chat-template deepseek-ocrto--chat-template deepseek.The official llama.cpp DeepSeek-OCR support PR uses
--chat-template deepseek, notdeepseek-ocr. Update the test command accordingly.Likely an incorrect or invalid review comment.
tools/mtmd/CMakeLists.txt (1)
30-30: DeepSeekOCR model correctly added to mtmd targetIncluding
models/deepseekocr.cppin themtmdlibrary sources looks consistent with how other vision models are wired; no further build changes needed here.tools/mtmd/models/models.h (1)
60-63: clip_graph_deepseekocr declaration matches existing clip_graph patternConstructor and
build()override follow the same convention as otherclip_graph_*structs; this is safe and consistent.tools/mtmd/clip.h (1)
105-111: DeepSeekOCR predicate integrates cleanly with existing clip_is_ helpers*
clip_is_deepseekocr(const struct clip_ctx * ctx)matches the existing pattern (e.g.,clip_is_llava,clip_is_gemma3) and its implementation usesctx->proj_type() == PROJECTOR_TYPE_DEEPSEEKOCR, which is consistent with the new projector type.src/llama-vocab.cpp (1)
2353-2375: DeepSeek OCR end-of-sentence token correctly marked as EOGAdding
"<|end▁of▁sentence|>"to the EOG-text list keepsspecial_eog_idsconsistent with the earlierspecial_eot_idauto-detection for the same token, so generation will now stop on it as intended for DeepSeek OCR-style models.tools/mtmd/tests/test-1-extracted.txt (1)
1-42: OCR test fixture content looks fineThe added text file is pure test data and is appropriate as a moderately complex OCR sample; no code or tooling concerns here.
tools/mtmd/clip-impl.h (1)
56-58: DeepSeek OCR keys, tensor names, and projector type wiring look consistent
- SAM hyperparameter keys (
KEY_SAM_N_HEAD,KEY_SAM_N_BLOCK,KEY_SAM_N_EMBD) follow the existingclip.vision.*naming scheme.- Updated
TN_IMAGE_NEWLINEto"v.image_newline"and newTN_IMAGE_SEPERATOR/"v.view_seperator"align with the otherv.*vision tensor prefixes and the strings used in the DeepSeek-OCR graph / converter.TN_MM_PROJECTORchanged to"mm.model.fc.%s"is compatible with models that expose both weight and bias under the same prefix, and is used alongsidePROJECTOR_TYPE_IDEFICS3/PROJECTOR_TYPE_DEEPSEEKOCRpaths.TN_SAM_*macros for SAM blocks (pos embedding, qkv, neck, net, etc.) match the tensor names consumed inmodels/deepseekocr.cppand give a clear, consistent naming pattern for SAM layers.PROJECTOR_TYPE_DEEPSEEKOCRis added to theprojector_typeenum and registered inPROJECTOR_TYPE_NAMESas"deepseekocr", which keepsclip_projector_type_from_string()andctx->proj_type()dispatch coherent.Given this is a mirror of upstream wiring, these additions look correct and low-risk. Based on learnings, please surface any upstream inconsistencies to ngxson if you notice them during integration.
Also applies to: 100-107, 145-157, 180-238
tools/mtmd/mtmd.cpp (2)
177-177: LGTM: Proper flash attention type derivationThis change correctly derives the CLIP flash attention type from the user-provided parameter instead of hardcoding
AUTO, allowing proper control over flash attention behavior.
834-835: LGTM: DeepSeekOCR encoding pathAppropriately adds DeepSeekOCR to the condition for non-batched per-image encoding, mirroring the existing GLM behavior. This aligns with the broader DeepSeekOCR integration.
tools/mtmd/models/siglip.cpp (1)
46-46: LGTM: Consistent projection weight renamingThe change from
model.projectiontomodel.fc_waligns with the broader refactor to standardize projection weight naming across the codebase. This is consistent with similar changes in glm4v.cpp and other model files.src/llama-chat.h (1)
31-31: LGTM: DeepSeekOCR chat template enumerationThe new
LLM_CHAT_TEMPLATE_DEEPSEEK_OCRenum value is appropriately positioned after the other DeepSeek variants and supports the broader DeepSeekOCR integration.src/llama-graph.cpp (1)
1169-1172: LGTM: Correct tensor reported to callbackThis fixes the callback to report the actually weighted tensor (
experts) instead of the pre-weighted tensor (cur) whenweight_before_ffnis false. This makes the callback consistent with Line 1091, which reportscurwhen weights are applied before FFN.tools/mtmd/tests/test-1-extracted.md (1)
1-85: LGTM: OCR test data with expected artifactsThis test data file provides realistic OCR output including structured markers and coordinates. The grammar warnings flagged by static analysis (e.g., "sky to sky" hyphenation at line 48) are likely intentional artifacts of real OCR output and should be preserved to ensure tests validate against realistic scenarios.
src/llama-arch.h (1)
74-74: LGTM: DeepSeekOCR architecture enumerationThe new
LLM_ARCH_DEEPSEEK2OCRenum value is appropriately positioned afterLLM_ARCH_DEEPSEEK2, following logical naming conventions and enabling OCR-specific architecture handling throughout the codebase.tools/mtmd/models/glm4v.cpp (1)
98-98: LGTM: Consistent projection weight renamingThe change from
model.projectiontomodel.fc_win the FC projector is consistent with the broader refactor to standardize projection weight naming across model implementations.src/models/deepseek2.cpp (3)
7-7: LGTM: OCR flag initialization follows existing pattern.The
is_ocrflag is initialized consistently with the existingis_litepattern.
85-88: Verify temperature scaling support for OCR path.The OCR attention path doesn't apply
inp_attn_scale(temperature scaling) before callingbuild_attn, unlike both the MLA path (lines 173-177) and non-MLA path (lines 211-215). Ifhparams.f_attn_temp_scaleis set for an OCR model, this configuration would be silently ignored.Confirm whether OCR models intentionally don't support temperature scaling or if this should be added for consistency.
Based on learnings, please relay this concern to the upstream contributor.
79-83: Reconsider hardcoded freq_base assertion for OCR rope configuration.The OCR path asserts
freq_base=10000.0and applies simplified rope parameters (freq_scale=1, ext_factor=0, attn_factor=1, beta_fast=0, beta_slow=0), while the non-OCR path uses dynamic rope scaling parameters. The assertion prevents invalid models from running but will fail if any DeepSeek-OCR variant uses a different freq_base. Consider using the model'sfreq_basevariable instead of asserting a fixed value, or document why all DeepSeek-OCR models must use freq_base=10000.0. The simplified rope parameters appear intentional for fixed-token OCR (unlike the variable-length context extension in DeepSeek-V2), but this design choice deserves a code comment explaining the rationale.src/llama-chat.cpp (2)
52-52: LGTM: Chat template mapping added correctly.The "deepseek-ocr" mapping follows the existing pattern.
545-549: LGTM: Passthrough implementation is appropriate for OCR.The implementation correctly concatenates message content without applying template formatting, which is appropriate for OCR use cases where the content is likely image data or pre-formatted text.
src/llama-arch.cpp (2)
70-70: LGTM: Architecture name added correctly.The architecture name mapping follows the existing pattern and uses a descriptive name.
1466-1497: Verify tensor list completeness for OCR variant.The DEEPSEEK2OCR tensor list includes both simple attention tensors (
LLM_TENSOR_ATTN_Q,LLM_TENSOR_ATTN_K,LLM_TENSOR_ATTN_Vat lines 1474-1476) and MLA-style tensors (LLM_TENSOR_ATTN_Q_A,LLM_TENSOR_ATTN_Q_B,LLM_TENSOR_ATTN_KV_A_MQA, etc., at lines 1477-1483). The OCR implementation insrc/models/deepseek2.cpp(lines 68-70) only uses the simplewq/wk/wvtensors.Confirm whether:
- The MLA tensors are included for future flexibility or model variants.
- All tensors in this list are actually used/loaded for OCR models, or if some are unused.
Including unused tensors in the list doesn't break functionality but could cause unnecessary memory allocation attempts during model loading.
Based on learnings, please relay this concern to the upstream contributor.
src/llama-model.cpp (2)
1619-1663: DeepSeek2OCR hparams/type and RoPE classification look consistent with DeepSeek2
- Treating
LLM_ARCH_DEEPSEEK2OCRas a sibling ofLLM_ARCH_DEEPSEEK2inload_hparams()and addingis_ocrto gateQ_LORA_RANK/KV_LORA_RANKreads makes sense and avoids requiring absent keys for OCR GGUFs.- The added
case 12: type = LLM_TYPE_3B;in the DeepSeek2* block is reasonable for the 12‑layer 3B OCR LM and also harmless for any 12‑layer non‑OCR DeepSeek2 variants.- Mapping
LLM_ARCH_DEEPSEEK2OCRtoLLAMA_ROPE_TYPE_NORMalongsideLLM_ARCH_DEEPSEEK2aligns with the upstream “Fix RoPE type for DeepSeek‑OCR LM” intent and keeps RoPE handling consistent.Also applies to: 7834-7835
4685-4713: is_ocr tensor layout and graph builder reuse for DeepSeek2OCR are coherent
- The new
is_ocrflag in theLLM_ARCH_DEEPSEEK2/DEEPSEEK2OCRtensor‑creation block cleanly splits the OCR path: OCR layers build standard{n_embd, n_embd}Q/K/V/O and the usual dense/MoE FFN + shared‑expert tensors, thencontinue, so MLA‑specific and LoRA‑specific tensors are never expected for OCR.- Shapes in the OCR path mirror the existing DeepSeek MoE layout (including
ffn_gate_inp,ffn_gate_exps,ffn_down_exps,ffn_up_exps, and shared‑expert tensors), which should match the DeepSeek‑OCR GGUF weights as exported in the upstream PR.- Routing both
LLM_ARCH_DEEPSEEK2andLLM_ARCH_DEEPSEEK2OCRthroughllm_build_deepseek2inbuild_graph()keeps the execution path unified while still allowing the builder to branch based on which tensors were instantiated.Also applies to: 4715-4742, 7482-7485
gguf-py/gguf/gguf_writer.py (1)
1115-1140: LGTM!The new vision/SAM writer methods follow the established patterns in this file. They correctly use
add_uint32for integer metadata and reference the appropriate keys fromKeys.ClipVision.WINDOW_SIZEandKeys.ClipVision.SAM.*defined inconstants.py.gguf-py/gguf/tensor_mapping.py (2)
1528-1578: LGTM!The SAM tensor mappings are comprehensive and follow the established pattern. The tensor paths (
model.sam_model.*) align with the usage indeepseekocr.cpp.
1272-1274: The tensor name usesmodel.view_seperator(note: "seperator" not "separator"), which matches DeepSeek-OCR's actual model weights. Since this exact spelling appears identically in bothtensor_mapping.pyandconstants.py, it is intentional and necessary for correct tensor lookup during model conversion. No change needed.tools/mtmd/models/deepseekocr.cpp (5)
5-24: LGTM!The
window_partitionfunction correctly implements sliding window partitioning with proper padding calculation and dimension reshaping. The implementation follows the referenced approach from Acly's comment.
28-51: LGTM!The
window_unpartitionfunction correctly reverses the window partition operation, properly handling the removal of padding viaggml_view_4d.
53-89: LGTM!The
get_rel_posfunction correctly handles relative position embedding extraction with optional bilinear interpolation for size mismatches. The assertions provide good input validation.
294-324: LGTM!The final section correctly combines SAM and CLIP outputs, applies the projection layer, and adds the special tokens (image newline and view separator). The square dimension assumption aligns with the expected output format.
276-276: This review comment flagged a non-existent scope issue. Then_embdvariable declared at line 98 in the SAM building block is properly accessible at line 276, which is within a nested conditional inside the same block. In C++, variables declared in an outer scope are accessible in all nested inner scopes.tools/mtmd/tests/test-deepseek-ocr.py (1)
16-55: LGTM!The
run_mtmd_deepseek_ocrfunction properly handles subprocess execution with timeout, captures output, and provides error handling. The static analysis hint (S603) about untrusted input is a false positive in this test context where arguments come from controlled CLI defaults.convert_hf_to_gguf.py (1)
715-721: DeepSeek-OCR:language_config→text_configremap looks correctThe new
language_confighandling matches the released DeepSeek-OCRconfig.jsonstructure and ensuresTextModelsees the language side astext_config. This should let the DeepseekV2 text decoder reuse the existing DeepseekV2 path without special casing. I don’t see issues here.gguf-py/gguf/constants.py (1)
293-309: LGTM - DeepSeek-OCR constants are well-structured.The new architecture, tensor definitions, and mappings follow existing conventions consistently. The SAM-related keys and tensor names are properly organized under the ClipVision namespace, and the DEEPSEEK2OCR architecture mirrors DEEPSEEK2 appropriately for an OCR variant.
Also applies to: 413-413, 698-712, 823-823, 1107-1121, 2441-2472, 3317-3320, 3510-3510
tools/mtmd/clip-model.h (2)
65-119: LGTM - SAM hyperparameters and attention helper methods.The SAM-specific fields (
sam_n_layer,sam_n_head,sam_n_embd) follow existing patterns with sensible defaults. Theglobal_attn_indices()andis_global_attn()helper methods correctly implement SAM's global attention layer selection pattern.
166-168: LGTM - SAM layer and encoder fields are properly structured.The relative position tensors in
clip_layerand SAM encoder fields inclip_modelare correctly initialized and follow existing conventions. Thesam_layersvector appropriately mirrors the existinglayerspattern for the SAM encoder.Also applies to: 340-343, 354-356
tools/mtmd/clip.cpp (2)
4080-4082: LGTM!The
clip_is_deepseekocrpredicate function follows the established pattern of other projector type checks in this file.
839-843: LGTM - DEEPSEEKOCR graph builder integration.The new case for
PROJECTOR_TYPE_DEEPSEEKOCRfollows the established pattern for adding new projector types to the graph building switch statement.
| @ModelBase.register("DeepseekOCRForCausalLM") | ||
| class DeepseekOCRVisionModel(MmprojModel): | ||
| def set_gguf_parameters(self): | ||
| super().set_gguf_parameters() | ||
| hparams = self.hparams | ||
| self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR) | ||
| # default values below are taken from HF tranformers code | ||
| self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6)) | ||
| self.gguf_writer.add_vision_use_gelu(True) | ||
| # calculate proj_scale_factor (used by tinygemma3 test model) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| n_per_side = int(image_seq_length ** 0.5) | ||
| image_size = self.hparams["image_size"] | ||
| patch_size = self.hparams["patch_size"] | ||
| proj_scale_factor = (image_size // patch_size) // n_per_side | ||
| if proj_scale_factor > 0 and proj_scale_factor != 4: | ||
| # we only need to write this if it's not the default value | ||
| # in this case, we are converting a test model | ||
| self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor) | ||
| # @bluebread: there's no window_size in config but just add it here anyway | ||
| self.gguf_writer.add_vision_window_size(self.hparams.get("window_size", 14)) | ||
|
|
||
| # SAM configuration | ||
| sam_hparams = hparams['sam'] | ||
| self.gguf_writer.add_vision_sam_layers_count(sam_hparams['layers']) | ||
| self.gguf_writer.add_vision_sam_embedding_length(sam_hparams['width']) | ||
| self.gguf_writer.add_vision_sam_head_count(sam_hparams['heads']) | ||
|
|
||
| def get_vision_config(self) -> dict[str, Any]: | ||
| vision_config: dict[str, Any] | None = self.global_config.get("vision_config") | ||
|
|
||
| if not vision_config: | ||
| raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found") | ||
|
|
||
| vision_config['sam'] = vision_config['width']['sam_vit_b'] | ||
| vision_config.update(vision_config['width']['clip-l-14-224']) | ||
| vision_config['hidden_size'] = vision_config['width'] | ||
| vision_config['num_heads'] = vision_config['heads'] | ||
| vision_config['intermediate_size'] = vision_config['heads'] * 4 | ||
|
|
||
| return vision_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
DeepSeek-OCR: intermediate_size derivation in get_vision_config looks incorrect
get_vision_config() currently sets:
vision_config['hidden_size'] = vision_config['width']
vision_config['num_heads'] = vision_config['heads']
vision_config['intermediate_size'] = vision_config['heads'] * 4For the released DeepSeek-OCR config, CLIP-L has width = 1024, heads = 16, and mlp_ratio ≈ 3.7362, so the MLP/FFN dim should be on the order of hidden_size * mlp_ratio (~3.8k), not heads * 4 (64). That gives nonsense FFN metadata for the vision encoder and may break consumers that rely on GGUF metadata instead of inferring from weights.
I’d strongly recommend computing intermediate_size from the CLIP config instead of heads * 4, e.g.:
Proposed fix for DeepSeek-OCR vision config
def get_vision_config(self) -> dict[str, Any]:
vision_config: dict[str, Any] | None = self.global_config.get("vision_config")
if not vision_config:
raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
- vision_config['sam'] = vision_config['width']['sam_vit_b']
- vision_config.update(vision_config['width']['clip-l-14-224'])
- vision_config['hidden_size'] = vision_config['width']
- vision_config['num_heads'] = vision_config['heads']
- vision_config['intermediate_size'] = vision_config['heads'] * 4
+ # Extract SAM and CLIP-L configs from nested vision_config["width"]
+ width_cfg = vision_config["width"]
+ sam_cfg = width_cfg["sam_vit_b"]
+ clip_cfg = width_cfg["clip-l-14-224"]
+
+ # Promote SAM config and CLIP-L config to top level
+ vision_config["sam"] = sam_cfg
+ vision_config.update(clip_cfg)
+
+ # Normalize fields expected by MmprojModel / gguf_writer
+ vision_config["hidden_size"] = clip_cfg["width"]
+ vision_config["num_heads"] = clip_cfg["heads"]
+
+ mlp_ratio = vision_config.get("mlp_ratio")
+ if mlp_ratio is not None:
+ vision_config["intermediate_size"] = int(round(clip_cfg["width"] * mlp_ratio))
+ else:
+ # Fallback: standard ViT MLP ratio
+ vision_config["intermediate_size"] = clip_cfg["width"] * 4
return vision_configStatic-analysis nits in this class
tensor_force_quant(self, name, new_name, bid, n_dims)doesn’t usenew_name,bid, orn_dims.modify_tensors(self, data_torch, name, bid)doesn’t usebid.
To keep the signature compatible with the base class while satisfying linters, consider explicitly deleting the unused arguments:
Suggested cleanups for unused parameters
def tensor_force_quant(self, name, new_name, bid, n_dims):
- if ".embeddings." in name or 'pos_embed' in name:
+ del new_name, bid, n_dims # unused, required by base signature
+ if ".embeddings." in name or 'pos_embed' in name:
return gguf.GGMLQuantizationType.F32
if ".rel_pos_h" in name or '.rel_pos_w' in name:
return gguf.GGMLQuantizationType.F32
return gguf.GGMLQuantizationType.F16
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
- # Only process vision-related tensors, skip language model tensors
+ del bid # unused
+ # Only process vision-related tensors, skip language model tensors🧰 Tools
🪛 Ruff (0.14.10)
5998-5998: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 5966 to 6006, the
DeepseekOCRVisionModel.get_vision_config sets intermediate_size incorrectly as
heads * 4 and leaves some method parameters unused; change intermediate_size to
compute from the model hidden size and an mlp_ratio (e.g. intermediate_size =
int(vision_config.get("mlp_ratio", 4.0) * vision_config["hidden_size"])) so it
matches released CLIP configs, ensure hidden_size and num_heads are set from the
correct keys before computing mlp_ratio, and fix unused-parameter lint warnings
in tensor_force_quant and modify_tensors by either deleting the unused args
inside the methods (e.g. del new_name, del bid, del n_dims) or renaming them to
_new_name, _bid, _n_dims to keep signatures compatible.
| def __init__(self, *args, **kwargs): | ||
| super().__init__(*args, **kwargs) | ||
| hparams: dict = ModelBase.load_hparams(self.dir_model, is_mistral_format=False) | ||
| self.origin_hf_arch = hparams.get('architectures', [None])[0] | ||
|
|
||
| if self.origin_hf_arch == "DeepseekOCRForCausalLM": | ||
| self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR | ||
| self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch] | ||
| self.gguf_writer.add_architecture() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeepseekV2 OCR path: minor rope metadata duplication; rest looks good
The OCR-specific wiring in DeepseekV2Model looks consistent with the DeepSeek-OCR config:
__init__detectingorigin_hf_arch == "DeepseekOCRForCausalLM"and switchingmodel_archtoDEEPSEEK2OCR(plus updatinggguf_writer.archand adding the architecture) matchesconfig.jsonand is necessary so OCR decoders are distinguishable from standard DeepSeek-V2.- In
set_gguf_parameters:- For OCR (
is_ocr), you avoid forcingnum_key_value_heads = 1and skip MLA key/value fields, which is correct givenuse_mla: falseand theqk_*dims being zero in the language config. - For non-OCR, the MLA-specific
kv_lora_rankdefaults andadd_key_length*_mla/add_value_length*_mlacalls look aligned with existing DeepSeek-V2 handling. - New
routed_scaling_factorandnorm_topk_probdefaults wired intoadd_expert_weights_scale/add_expert_weights_normmake sense.
- For OCR (
One small issue: for non-OCR models, add_rope_dimension_count(hparams["qk_rope_head_dim"]) is now called twice — once inside the if not is_ocr: block and again unconditionally at the end of the method. It’s harmless but redundant and will produce duplicate GGUF fields.
Proposed deduplication of `add_rope_dimension_count`
def set_gguf_parameters(self):
is_ocr = (self.model_arch == gguf.MODEL_ARCH.DEEPSEEK2OCR)
@@
# note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA
- if not is_ocr:
- self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"])
- self.gguf_writer.add_value_length(kv_lora_rank)
- self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
- self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
- self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])
+ if not is_ocr:
+ self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"])
+ self.gguf_writer.add_value_length(kv_lora_rank)
+ self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"])
+ self.gguf_writer.add_value_length_mla(hparams["v_head_dim"])
@@
- self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
- self.gguf_writer.add_expert_weights_scale(routed_scaling_factor)
- self.gguf_writer.add_expert_weights_norm(norm_topk_prob)
-
- self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])
+ self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
+ self.gguf_writer.add_expert_weights_scale(routed_scaling_factor)
+ self.gguf_writer.add_expert_weights_norm(norm_topk_prob)
+
+ # Common RoPE dim for both OCR and non-OCR variants
+ self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])The additions in modify_tensors to also skip sam_model, image_newline, and view_seperator tensors look correct for DeepSeek-OCR: those should be handled by the mmproj/vision stream rather than the text model.
Also applies to: 7259-7274, 7279-7288, 7292-7295, 7307-7313
| MODEL_TENSOR.V_SAM_NET_2: "v.sam.net_2", | ||
| MODEL_TENSOR.V_SAM_NET_3: "v.sam.net_3", | ||
| MODEL_TENSOR.V_ENC_EMBD_IMGNL: "v.image_newline", # Deepseek-OCR | ||
| MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_seperator", # Deepseek-OCR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "seperator" should be "separator".
The string "v.view_seperator" contains a spelling error. This should be "v.view_separator" for correctness.
Note: This typo also appears in tools/mtmd/clip-model.h at line 244 (view_seperator). Both should be corrected together to maintain consistency.
🔎 Proposed fix
- MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_seperator", # Deepseek-OCR
+ MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_separator", # Deepseek-OCR🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 1122 (and also in
tools/mtmd/clip-model.h at line 244), there is a spelling mistake
"v.view_seperator" that should be "v.view_separator"; update both occurrences to
the correct spelling to keep the keys consistent across code and any serialized
formats (perform a targeted rename in both files and run any tests/linters to
ensure no other references remain).
| ggml_tensor * mm_2_b = nullptr; | ||
|
|
||
| ggml_tensor * image_newline = nullptr; | ||
| ggml_tensor * view_seperator = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "view_seperator" should be "view_separator".
This matches the typo flagged in gguf-py/gguf/constants.py. Both should be corrected together.
🔎 Proposed fix
- ggml_tensor * view_seperator = nullptr;
+ ggml_tensor * view_separator = nullptr;Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/mtmd/clip-model.h around line 244, the variable name "view_seperator"
contains a typo; rename it to "view_separator" and update all references in this
file to the new name, ensuring declarations, initializations, and any
uses/assignments (including nullptr) are consistently changed; also update the
corresponding typo in gguf-py/gguf/constants.py in the same commit so both
places match.
| ggml_tensor * neck_0_w; | ||
| ggml_tensor * neck_1_w; | ||
| ggml_tensor * neck_1_b; | ||
| ggml_tensor * neck_2_w; | ||
| ggml_tensor * neck_3_w; | ||
| ggml_tensor * neck_3_b; | ||
| ggml_tensor * net_2; | ||
| ggml_tensor * net_3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing nullptr initialization for tensor pointers.
These tensor pointers are not initialized to nullptr, unlike most other tensor pointers in this struct. This inconsistency could lead to undefined behavior if the pointers are checked or accessed before being explicitly set.
🔎 Proposed fix
- ggml_tensor * neck_0_w;
- ggml_tensor * neck_1_w;
- ggml_tensor * neck_1_b;
- ggml_tensor * neck_2_w;
- ggml_tensor * neck_3_w;
- ggml_tensor * neck_3_b;
- ggml_tensor * net_2;
- ggml_tensor * net_3;
+ ggml_tensor * neck_0_w = nullptr;
+ ggml_tensor * neck_1_w = nullptr;
+ ggml_tensor * neck_1_b = nullptr;
+ ggml_tensor * neck_2_w = nullptr;
+ ggml_tensor * neck_3_w = nullptr;
+ ggml_tensor * neck_3_b = nullptr;
+ ggml_tensor * net_2 = nullptr;
+ ggml_tensor * net_3 = nullptr;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ggml_tensor * neck_0_w; | |
| ggml_tensor * neck_1_w; | |
| ggml_tensor * neck_1_b; | |
| ggml_tensor * neck_2_w; | |
| ggml_tensor * neck_3_w; | |
| ggml_tensor * neck_3_b; | |
| ggml_tensor * net_2; | |
| ggml_tensor * net_3; | |
| ggml_tensor * neck_0_w = nullptr; | |
| ggml_tensor * neck_1_w = nullptr; | |
| ggml_tensor * neck_1_b = nullptr; | |
| ggml_tensor * neck_2_w = nullptr; | |
| ggml_tensor * neck_3_w = nullptr; | |
| ggml_tensor * neck_3_b = nullptr; | |
| ggml_tensor * net_2 = nullptr; | |
| ggml_tensor * net_3 = nullptr; |
🤖 Prompt for AI Agents
In tools/mtmd/clip-model.h around lines 345 to 352, the tensor pointer fields
(neck_0_w, neck_1_w, neck_1_b, neck_2_w, neck_3_w, neck_3_b, net_2, net_3) are
not initialized like other tensor pointers; initialize each of these pointers to
nullptr at declaration to match the struct's initialization pattern and avoid
undefined behavior, and also verify whether net_2/net_3 are typos (should they
be named neck_2/neck_3) and rename them if appropriate.
| model.sam_layers.resize(model.n_sam_layers); | ||
| for (int il = 0; il < model.n_sam_layers; ++il) { | ||
| auto & layer = model.sam_layers[il]; | ||
| layer.qkv_w = get_tensor(string_format(TN_SAM_ATTN_QKV, il, "weight")); | ||
| layer.qkv_b = get_tensor(string_format(TN_SAM_ATTN_QKV, il, "bias")); | ||
| layer.o_w = get_tensor(string_format(TN_SAM_ATTN_OUT, il, "weight")); | ||
| layer.o_b = get_tensor(string_format(TN_SAM_ATTN_OUT, il, "bias")); | ||
| layer.ln_1_w = get_tensor(string_format(TN_SAM_PRE_NORM, il, "weight")); | ||
| layer.ln_1_b = get_tensor(string_format(TN_SAM_PRE_NORM, il, "bias")); | ||
| layer.ln_2_w = get_tensor(string_format(TN_SAM_POST_NORM, il, "weight")); | ||
| layer.ln_2_b = get_tensor(string_format(TN_SAM_POST_NORM, il, "bias")); | ||
| layer.rel_pos_h = get_tensor(string_format(TN_SAM_ATTN_POS_H, il)); | ||
| layer.rel_pos_w = get_tensor(string_format(TN_SAM_ATTN_POS_W, il)); | ||
| layer.ff_up_w = get_tensor(string_format(TN_SAM_FFN_UP, il, "weight")); | ||
| layer.ff_up_b = get_tensor(string_format(TN_SAM_FFN_UP, il, "bias")); | ||
| layer.ff_down_w = get_tensor(string_format(TN_SAM_FFN_DOWN, il, "weight")); | ||
| layer.ff_down_b = get_tensor(string_format(TN_SAM_FFN_DOWN, il, "bias")); | ||
| } | ||
| model.neck_0_w = get_tensor(string_format(TN_SAM_NECK, 0, "weight")); | ||
| model.neck_1_b = get_tensor(string_format(TN_SAM_NECK, 1, "bias")); | ||
| model.neck_1_w = get_tensor(string_format(TN_SAM_NECK, 1, "weight")); | ||
| model.neck_2_w = get_tensor(string_format(TN_SAM_NECK, 2, "weight")); | ||
| model.neck_3_b = get_tensor(string_format(TN_SAM_NECK, 3, "bias")); | ||
| model.neck_3_w = get_tensor(string_format(TN_SAM_NECK, 3, "weight")); | ||
| model.net_2 = get_tensor(string_format(TN_SAM_NET, 2, "weight")); | ||
| model.net_3 = get_tensor(string_format(TN_SAM_NET, 3, "weight")); | ||
| model.image_newline = get_tensor(TN_IMAGE_NEWLINE); | ||
| model.view_seperator = get_tensor(TN_IMAGE_SEPERATOR); | ||
| model.fc_w = get_tensor(string_format(TN_MM_PROJECTOR, "weight")); | ||
| model.fc_b = get_tensor(string_format(TN_MM_PROJECTOR, "bias")); | ||
| } break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for n_sam_layers in the codebase
rg -n "n_sam_layers" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 204
🏁 Script executed:
# Also search for where n_sam_layers is assigned more broadly
rg -n "n_sam_layers\s*=" --type cppRepository: ngxson/llama.cpp
Length of output: 154
🏁 Script executed:
# Check the hparams loading section mentioned at lines 1200-1210
sed -n '1200,1210p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 633
🏁 Script executed:
# Examine lines around 1650 and 1679 to see context
sed -n '1640,1680p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 3451
🏁 Script executed:
# Search for where model.n_sam_layers is assigned in the entire codebase
rg -n "\.n_sam_layers\s*=" tools/mtmd/ --type cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Also check clip-model.h to see the complete struct definition and any initialization
grep -n "n_sam_layers" tools/mtmd/clip-model.h -B 2 -A 2Repository: ngxson/llama.cpp
Length of output: 213
🏁 Script executed:
# Look for hparams.sam_n_layer usage to see where it's assigned to model
rg -n "hparams\.sam_n_layer" tools/mtmd/ --type cppRepository: ngxson/llama.cpp
Length of output: 247
🏁 Script executed:
# Check exact indentation of line 1679 by looking at raw bytes
sed -n '1679p' tools/mtmd/clip.cpp | od -c | head -20Repository: ngxson/llama.cpp
Length of output: 100
🏁 Script executed:
# Check indentation with cat -A to show all whitespace
sed -n '1679p' tools/mtmd/clip.cpp | cat -ARepository: ngxson/llama.cpp
Length of output: 86
🏁 Script executed:
# Check if hparams.sam_n_layer is ever assigned to model.n_sam_layers
rg -n "model\.n_sam_layers\s*=" tools/mtmd/ --type cpp -A 2 -B 2Repository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Let's look at the broader context around line 1679 to understand the structure better
sed -n '1670,1680p' tools/mtmd/clip.cpp | cat -nRepository: ngxson/llama.cpp
Length of output: 1017
🏁 Script executed:
# Check how other projector types handle similar assignments to understand the pattern
grep -n "hparams\." tools/mtmd/clip.cpp | grep "model\." -A 1 | head -30Repository: ngxson/llama.cpp
Length of output: 1826
🏁 Script executed:
# Look for where hparams values are transferred to model for other projector types
sed -n '1600,1660p' tools/mtmd/clip.cpp | grep -E "(hparams|model\.)" | head -20Repository: ngxson/llama.cpp
Length of output: 1908
🏁 Script executed:
# Check the exact indentation context - view more lines around 1679
sed -n '1675,1685p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 766
Assign hparams.sam_n_layer to model.n_sam_layers for DEEPSEEKOCR projector type.
The hparams loading at line 1206 reads sam_n_layer from the GGUF file, but this value is never assigned to model.n_sam_layers. The code at line 1650 then uses model.n_sam_layers (hardcoded default of 12) instead of the actual value from hparams, causing the loop to always resize to 12 layers regardless of what's in the model file.
Also, line 1679 has inconsistent indentation (extra leading space before the closing brace).
| # Check if the second argument is "flash", then enable flash attention | ||
| # This is useful to test if flash attention off works correctly | ||
| FLASH_ATTN="on" | ||
| if [ "${2:-}" = "flash_off" ] || [ "${1:-}" = "flash_off" ]; then | ||
| FLASH_ATTN="off" | ||
| echo "Flash attention disabled..." | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is inconsistent with implementation.
The comment on line 31 states "Check if the second argument is 'flash'" but the code on line 34 checks both the first AND second arguments. This inconsistency could confuse future maintainers.
🔎 Proposed fix to align comment with implementation
-# Check if the second argument is "flash", then enable flash attention
+# Check if either argument is "flash_off", then disable flash attention
# This is useful to test if flash attention off works correctly🤖 Prompt for AI Agents
In tools/mtmd/tests.sh around lines 31 to 37, the comment incorrectly says it
checks only the second argument but the code checks both first and second
arguments for "flash_off"; update the comment to accurately describe that both
the first and second positional arguments are examined (or adjust the
conditional to match the original intent), and keep wording concise e.g. "Check
if the first or second argument is 'flash_off', then disable flash attention" so
the comment and implementation align.
| add_test_vision "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M" | ||
| # add_test_vision "ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M" # does not work on my mac M3 Ultra | ||
| # add_test_vision "ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M" # not always working | ||
| add_test_vision "sabafallah/DeepSeek-OCR-GGUF:q8_0" -p "Free OCR." --chat-template deepseek-ocr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Check if the model with lowercase q8_0 exists via HuggingFace API
curl -s "https://huggingface.co/api/models/sabafallah/DeepSeek-OCR-GGUF" | jq -r '.siblings[].rn' 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 99
🏁 Script executed:
# Search for deepseek-ocr chat template in the codebase
fd -e sh -e py -e json | xargs grep -l "deepseek-ocr" 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 117
🏁 Script executed:
# Find and check test validation logic in tests.sh around the OCR tests
sed -n '169,175p' tools/mtmd/tests.shRepository: ngxson/llama.cpp
Length of output: 315
🏁 Script executed:
# Check all quantization formats used in tests.sh to confirm the pattern
grep -o ":[Qq][0-9]_[0-9]" tools/mtmd/tests.sh | sort | uniq -cRepository: ngxson/llama.cpp
Length of output: 101
Use uppercase Q8_0 for consistency with other quantization formats.
The DeepSeek-OCR model uses lowercase q8_0 while all other models in the test suite use uppercase quantization formats (Q8_0, Q4_0). Repository indexing or model matching may be case-sensitive, causing consistency issues. Change to uppercase Q8_0 to match the pattern used throughout the test suite.
Since this is a mirror PR, please relay this formatting concern to the upstream contributor.
🤖 Prompt for AI Agents
In tools/mtmd/tests.sh around line 111, the model tag uses lowercase "q8_0"
while the test suite standard is uppercase quantization labels; update the tag
to "Q8_0" for consistency (replace "sabafallah/DeepSeek-OCR-GGUF:q8_0" with
"sabafallah/DeepSeek-OCR-GGUF:Q8_0") and also add a short note in the PR
description requesting the upstream contributor to use uppercase quantization
labels in their model naming to avoid case-sensitive matching issues.
| print(f"Result: {'PASS' if free_ocr_similarity >= args.threshold else 'FAIL'}") | ||
| print("=" * 60) | ||
|
|
||
| # Markdown OCR results | ||
| print(f"\nReference Model Markdown output:\n{'-' * 40}") | ||
| print(expected_md_ocr) | ||
| print(f"\nDeepSeek-OCR Markdown output:\n{'-' * 40}") | ||
| print(llama_md_ocr) | ||
| print(f"\n{'=' * 60}") | ||
| print(f"Cosine Similarity (Markdown): {md_ocr_similarity:.4f}") | ||
| print(f"Threshold: {args.threshold}") | ||
| print(f"Result: {'PASS' if md_ocr_similarity >= args.threshold else 'FAIL'}") | ||
| print("=" * 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test doesn't exit with non-zero code on failure.
For CI integration, the script should exit with a non-zero code when similarity falls below the threshold. Currently, it prints "FAIL" but exits successfully.
🔎 Proposed fix
print(f"Result: {'PASS' if md_ocr_similarity >= args.threshold else 'FAIL'}")
print("=" * 60)
+
+ # Exit with non-zero code if any test fails
+ if free_ocr_similarity < args.threshold or md_ocr_similarity < args.threshold:
+ sys.exit(1)
if __name__ == "__main__":
main()Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tools/mtmd/tests/test-deepseek-ocr.py around lines 170 to 182, the script
prints "PASS"/"FAIL" but always exits with code 0; update the end of this
section to exit non‑zero on failure by evaluating both free_ocr_similarity and
md_ocr_similarity against args.threshold and calling sys.exit(1) if either is
below the threshold (sys.exit(0) or no explicit exit if both pass). Ensure sys
is imported at the top of the file if not already, and place the exit call after
the printed summary so CI receives a failing exit code when similarity checks
fail.
Mirror from upstream PR: ggml-org#17400
Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.