[Mirror] mtmd: Add DeepSeekOCR Support #66

ngxson · 2025-12-23T00:01:51Z

Mirror from upstream PR: ggml-org#17400

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

Release Notes

New Features
- Added DeepSeek-OCR model architecture with vision and optical character recognition capabilities
- Expanded vision/image processing support with SAM (Segment Anything Model) components and advanced tensor handling
- Added new vision-related parameters for enhanced image encoding and processing
Bug Fixes
- Added validation for tensor types in interpolation and attention operations
- Added error handling for unrecognized image scaling modes
Tests
- Added test cases and examples for DeepSeek-OCR model functionality

_{✏️ Tip: You can customize this high-level summary in your review settings.}

init commit

mtmd: fix vision model processing

…f/deepseek-ocr

testing Vision model loading

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

…ut in deepseek2 model

…f/deepseek-ocr

…e image decoding fails

…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp

Merged with PR ggml-org#17965

python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown

# Conflicts: # convert_hf_to_gguf.py

# Conflicts: # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # tools/mtmd/clip-impl.h # tools/mtmd/clip.cpp # tools/mtmd/models/models.h

# Conflicts: # src/llama-arch.cpp

- added GLM-4.6V to big tests - added missing deps for python test

coderabbitai · 2025-12-23T00:02:11Z

Walkthrough

Introduces comprehensive DeepSeek-OCR model support across the inference engine and tooling. Adds architecture definitions, model conversion logic with vision model handler, GGUF serialization support, CLIP vision integration with SAM attention mechanism, chat template, and testing infrastructure for OCR-capable vision models.

Changes

Cohort / File(s)	Summary
DeepSeek-OCR Architecture Definitions `src/llama-arch.h`, `src/llama-arch.cpp`, `src/llama-chat.h`, `src/llama-chat.cpp`	Added `LLM_ARCH_DEEPSEEK2OCR` enum value and architecture name mapping to "deepseek2-ocr"; added `LLM_CHAT_TEMPLATE_DEEPSEEK_OCR` chat template with no-template passthrough implementation
GGUF Constants and Serialization `gguf-py/gguf/constants.py`, `gguf-py/gguf/gguf_writer.py`	Added `MODEL_ARCH.DEEPSEEK2OCR`, new tensor identifiers for SAM components (V_SAM_POS_EMBD, V_SAM_PATCH_EMBD, V_SAM_ATTN_, V_SAM_MLP_, V_SAM_NECK, V_SAM_NET_*), ClipVision configuration keys (WINDOW_SIZE, SAM.BLOCK_COUNT, SAM.EMBEDDING_LENGTH, SAM.HEAD_COUNT), and writer methods for vision metadata
Tensor Mapping and Conversion `gguf-py/gguf/tensor_mapping.py`, `convert_hf_to_gguf.py`	Expanded tensor name mappings for DeepSeek-OCR SAM and vision components; introduced `DeepseekOCRVisionModel` class with vision config, tensor quantization, and specialized tensor mapping logic
DeepSeek2 Model Inference `src/llama-model.cpp`, `src/models/deepseek2.cpp`	Added OCR-variant branching (is_ocr flag) to skip LoRA/MLA weights for OCR models; introduced OCR-specific attention path using Neox-style RoPE in deepseek2.cpp
GGML Core Operations `ggml/src/ggml-cuda/upscale.cu`, `ggml/src/ggml.c`	Added abort fallback for unrecognized upscale mode; added type assertions for upscaling (F32) and flash attention mask (F16)
CLIP Vision Integration (MTMD) `tools/mtmd/clip-impl.h`, `tools/mtmd/clip-model.h`, `tools/mtmd/clip.h`, `tools/mtmd/clip.cpp`	Added SAM-related configuration keys (KEY_SAM_N_*), projector type `PROJECTOR_TYPE_DEEPSEEKOCR`, structure fields for SAM layers and positional embeddings, image preprocessing with bicubic resampling, batch encoding with SAM attention and relative position handling, and clip_is_deepseekocr predicate
DeepSeek-OCR Graph Construction `tools/mtmd/models/models.h`, `tools/mtmd/models/deepseekocr.cpp`	Added `clip_graph_deepseekocr` class with windowed attention, relative position embeddings, SAM layer processing, and DS-OCR CLIP pipeline construction
MTMD Tool Updates `tools/mtmd/CMakeLists.txt`, `tools/mtmd/mtmd.cpp`, `tools/mtmd/models/glm4v.cpp`, `tools/mtmd/models/siglip.cpp`	Added deepseekocr.cpp to build sources; updated flash attention type derivation; extended encoding to use non-batched path for DeepSeekOCR; refactored projection parameter from projection to fc_w in GLM4V and SigLIP
Utility and Auxiliary Support `src/llama-vocab.cpp`, `src/llama-graph.cpp`, `src/llama-kv-cache.cpp`	Added EOG token recognition for "<\|end▁of▁sentence\|>"; updated MOE FFN callback to pass weighted tensor; added commented-out YaRN RoPE override logic
Testing and Data `tools/mtmd/tests.sh`, `tools/mtmd/tests/tests-requirements.txt`, `tools/mtmd/tests/test-deepseek-ocr.py`, `tools/mtmd/tests/test-1-extracted.md`, `tools/mtmd/tests/test-1-extracted.txt`	Added flash attention toggle in test script, new DeepSeek-OCR test cases, Python test script with HuggingFace reference comparison via embedding similarity, test data files, and new Python dependencies (sentence-transformers, transformers, torch, torchvision)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The changes introduce a new model family across heterogeneous subsystems (conversion tooling, inference engine, vision models) with multiple interacting components: DeepseekOCRVisionModel for conversion, OCR-specific tensor handling and LoRA/MLA skipping in model loading, SAM windowed attention mechanism with relative position embeddings, image preprocessing pipelines with bicubic resampling, and extensive GGUF serialization support. While individual file changes are often straightforward, the cumulative scope spans vision model integration, attention mechanism variants, and architecture-specific branching in multiple critical paths, requiring careful verification of tensor mapping consistency, attention computation correctness, and image preprocessing alignment with reference implementations.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.57% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	The PR description is minimal, providing only a reference to the upstream PR URL and a note about using a 'Mirror PR' preset. It lacks substantive details about changes, testing, and context required by the template.	Expand the description to include a summary of the key changes introduced (e.g., DeepSeekOCR model support, SAM integration, vision processing), testing details, and any migration notes. Even for mirror PRs, a brief summary aids downstream reviewers.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding DeepSeekOCR support to the mtmd codepath, which is the primary objective of this mirror PR.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (14)

ggml/src/ggml-cuda/upscale.cu (1)
292-293: Good defensive programming; consider a more descriptive error message.

The fallback abort is appropriate and prevents undefined behavior when an unrecognized scale mode is encountered.

However, the error message could be more helpful for debugging by including the actual mode value.
🔎 Suggested improvement for error message
     } else {
-        GGML_ABORT("fatal error");
+        GGML_ABORT("unsupported upscale mode: %d", mode);
     }
ggml/src/ggml.c (2)

4906-4920: New F32-only precondition in ggml_interpolate_impl may reject non‑F32 callers

GGML_ASSERT(a->type == GGML_TYPE_F32); tightens the contract for all ggml_upscale* / ggml_interpolate* paths to F32 input. That is consistent with most existing kernels (which are F32‑centric), but it will now abort in debug builds if any existing or third‑party code feeds F16/BF16 tensors here.

Given this is a mirrored upstream change, it’s probably fine, but it’s worth double‑checking that:

No current llama.cpp callsites rely on non‑F32 interpolate/upscale.

External bindings (Rust, Python, etc.) aren’t exposing ggml_interpolate against F16/BF16 tensors.

If any such usage exists, consider either inserting an explicit cast to F32 at callsites or relaxing the assert (e.g., allowing F16/BF16 and handling conversion in the backend).

Based on learnings, this should be relayed upstream if it breaks callers.

5256-5269: Flash-attn mask is now hard‑required to be F16 – confirm all callsites comply

The new GGML_ASSERT(mask->type == GGML_TYPE_F16); in ggml_flash_attn_ext aligns with CUDA/Metal flash‑attn kernels, which already require F16 masks, but it’s a behavior change at the graph‑building layer:

Previously, masks could be F32 in some codepaths (and are still allowed as F16/F32 for generic softmax masks).

With this assert, any remaining F32 masks will now abort early instead of failing deeper in the backend.

Please verify that:

All internal llama.cpp users of ggml_flash_attn_ext now construct F16 masks (especially any older text‑only attention paths and non‑CUDA backends).

External bindings or downstream forks are not passing F32 masks.

If there is still mixed usage, one option would be to accept F32 masks here and explicitly cast them to F16 before invoking the flash‑attn backend; otherwise this assert is a good guard but may be a subtle breaking change for some integrations.

As per coding guidelines for mirrored PRs, this is likely something to bounce to the upstream authors if you see failures.

src/llama-kv-cache.cpp (1)

1403-1409: Commented YaRN attn_factor override is inert and safe

The added DeepSeek2/DEEPSEEK2OCR-specific YaRN attn_factor snippet is fully commented out and only documents a potential future override; it does not affect current behavior of build_rope_shift(). Keeping it as reference is fine in this mirrored PR.
tools/mtmd/clip-impl.h (1)
481-505: Harden debug helpers against null sources and unexpected tensor types

The new debugging utilities are useful, but there are a couple of edge cases worth guarding:

print_tensor_info assumes t->src[0] is always non-null; if it’s ever called on a leaf tensor, this will dereference a null pointer.

print_tensor_sum, print_tensor_data, and save_tensor_to_file GGML_ABORT on non-{F16,F32,I{8,16,32}} types; accidentally calling them on quantized tensors (common in this codebase) would crash the process, which is a bit harsh for diagnostics.

A small defensive tweak keeps behavior robust while preserving current usage.
Proposed defensive changes for debug helpers
 static std::string to_ne_string(const ggml_tensor * t) {
@@
-static void print_tensor_info(ggml_tensor * t) {
-    const struct ggml_tensor * src0 = t->src[0];
-    const struct ggml_tensor * src1 = t->src[1];
+static void print_tensor_info(ggml_tensor * t) {
+    const struct ggml_tensor * src0 = t->src[0];
+    const struct ggml_tensor * src1 = t->src[1];
+
+    if (!src0) {
+        printf("%s: %s (no src0)\n", t->name, ggml_type_name(t->type));
+        return;
+    }
@@
-    printf("%s: %s = %s(%s{%s}, %s)\n",
-        t->name, ggml_type_name(t->type), ggml_op_desc(t),
-        src0->name, to_ne_string(src0).c_str(),
-        src1 ? src1_str : "");
+    printf("%s: %s = %s(%s{%s}, %s)\n",
+        t->name, ggml_type_name(t->type), ggml_op_desc(t),
+        src0->name, to_ne_string(src0).c_str(),
+        src1 ? src1_str : "");
 }
@@
-    for (int64_t i3 = 0; i3 < ne[3]; i3++) {
+    for (int64_t i3 = 0; i3 < ne[3]; i3++) {
@@
-                    } else {
-                        GGML_ABORT("fatal error");
-                    }
+                    } else {
+                        // unsupported debug type, skip rather than abort
+                        continue;
+                    }
@@
-    printf("%s.sum = %.6f\n", t->name, sum);
+    printf("%s.sum = %.6f\n", t->name, sum);
 }
@@
-                    } else {
-                        GGML_ABORT("fatal error");
-                    }
+                    } else {
+                        // unsupported debug type, skip rather than abort
+                        continue;
+                    }
@@
-    const uint8_t * data = (data_ptr) ? data_ptr : (uint8_t *) tensor->data;
+    const uint8_t * data = (data_ptr) ? data_ptr : (const uint8_t *) tensor->data;
@@
-                    } else {
-                        GGML_ABORT("fatal error");
-                    }
+                    } else {
+                        // unsupported debug type, skip rather than abort
+                        continue;
+                    }
If you prefer to stay byte-for-byte with upstream in this mirror PR, consider at least the null-check for src0 and revisiting the GGML_ABORT paths the next time these helpers are touched. Based on learnings, this can then be relayed back upstream.

Also applies to: 518-549, 551-604, 606-719
tools/mtmd/tests/tests-requirements.txt (1)

1-5: Consider pinning dependency versions for reproducible tests

The added dependencies are appropriate for OCR testing with embedding similarity evaluation. However, consider pinning versions (e.g., sentence-transformers>=2.0.0,<3.0.0) to ensure reproducible test environments across different setups and prevent unexpected breakage from upstream changes.

src/llama-model.cpp (1)

6951-6962: Optional: consider clarifying print_info for OCR/lite DeepSeek2 variants

Currently print_info() logs n_lora_q and n_lora_kv for both LLM_ARCH_DEEPSEEK2 and LLM_ARCH_DEEPSEEK2OCR, even though OCR and “lite” variants intentionally skip loading some or all of these ranks (they’ll stay at 0). That’s correct but slightly opaque; if you want to be extra clear for users debugging models, you could either skip logging them for OCR, or add a brief comment/note that zero here is expected for OCR and lite configs.
tools/mtmd/models/deepseekocr.cpp (1)
98-102: Consider adding an assertion for head dimension divisibility.

The calculation d_heads = n_embd / n_heads assumes even divisibility. While this is typically guaranteed by model configuration, an assertion could catch misconfiguration early.
🔎 Optional defensive assertion
         const int n_embd  = hparams.sam_n_embd;
         const int n_layer = hparams.sam_n_layer;
         const int n_heads = hparams.sam_n_head;
+        GGML_ASSERT(n_embd % n_heads == 0 && "n_embd must be divisible by n_heads");
         const int d_heads = n_embd / n_heads;
         const int window  = hparams.attn_window_size;
tools/mtmd/tests/test-deepseek-ocr.py (1)

126-126: Misleading step numbering: "[2/3]" without a "[1/3]".

The print statement indicates step 2 of 3, but there's no "[1/3]" step visible. Consider either adding a "[1/3] Validating paths..." step or adjusting the numbering.
convert_hf_to_gguf.py (1)
1698-1698: Annotate MmprojModel.n_block_keys as ClassVar (RUF012)

n_block_keys is a mutable class attribute and Ruff is flagging it. Declaring it as a ClassVar (or making it an immutable tuple) will address the lint and better reflect intent.
Proposed minimal change
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar
@@
 class MmprojModel(ModelBase):
@@
-    n_block_keys = ["n_layers", "num_hidden_layers", "n_layer", "num_layers", "depth", "layers", "encoder_layers"]
+    n_block_keys: ClassVar[list[str]] = [
+        "n_layers",
+        "num_hidden_layers",
+        "n_layer",
+        "num_layers",
+        "depth",
+        "layers",
+        "encoder_layers",
+    ]
tools/mtmd/clip.cpp (4)
1200-1210: Minor formatting: inconsistent indentation at closing brace.

Line 1210 has an extra leading space before } break; compared to other case blocks in this switch statement.
Proposed fix
                     get_u32(KEY_ATTN_WINDOW_SIZE, hparams.attn_window_size, true);
-                 } break;
+                } break;
2399-2419: Consider adding input validation for edge cases.

The resize_bicubic_pillow function doesn't validate that target_width and target_height are positive before proceeding. While the caller (resize()) should provide valid dimensions, defensive checks would prevent potential division-by-zero or empty buffer issues.
Proposed validation
     static bool resize_bicubic_pillow(const clip_image_u8 & img, clip_image_u8 & dst, int target_width, int target_height) {
+        if (target_width <= 0 || target_height <= 0 || img.nx <= 0 || img.ny <= 0) {
+            return false;
+        }
+
         // Fixed-point precision: 22 bits = 32 (int32_t) - 8 (uint8_t pixels) - 2 (headroom for accumulation)
2990-2998: Float equality comparison may have precision issues.

At line 2993, else if (diff == best_diff) compares floating-point values for exact equality. Due to floating-point precision, this comparison might not trigger when mathematically expected, or might trigger unexpectedly. Consider using a small epsilon for comparison.

However, if this is intentionally mirroring Python's behavior for compatibility, this may be acceptable.

3268-3295: Consider using existing img_tool helpers for padding and compositing.

The manual padding fill loop (lines 3275-3280) and image copy loop (lines 3287-3295) duplicate functionality already available in img_tool::fill() and img_tool::composite(). Using these helpers would improve consistency and reduce code duplication.
Proposed refactor
                 clip_image_u8_ptr padded_img(clip_image_u8_init());
                 padded_img->nx = image_size;
                 padded_img->ny = image_size;
-                padded_img->buf.resize(image_size * image_size * 3); // black padding
+                padded_img->buf.resize(image_size * image_size * 3);

-                // Fill with mean color
-                for (int i = 0; i < image_size * image_size; ++i)
-                {
-                    padded_img->buf[i * 3 + 0] = pad_r;
-                    padded_img->buf[i * 3 + 1] = pad_g;
-                    padded_img->buf[i * 3 + 2] = pad_b;
-                }
+                // Fill with mean color
+                img_tool::fill(*padded_img, {pad_r, pad_g, pad_b});

                 // Calculate padding offsets (center the image)
                 int pad_x = (image_size - new_w) / 2;
                 int pad_y = (image_size - new_h) / 2;

-                // Copy scaled image into padded canvas
-                for (int y = 0; y < new_h; ++y){
-                    for (int x = 0; x < new_w; ++x){
-                        int src_idx = (y * new_w + x) * 3;
-                        int dst_idx = ((y + pad_y) * image_size + (x + pad_x)) * 3;
-                        padded_img->buf[dst_idx + 0] = scaled_img->buf[src_idx + 0];
-                        padded_img->buf[dst_idx + 1] = scaled_img->buf[src_idx + 1];
-                        padded_img->buf[dst_idx + 2] = scaled_img->buf[src_idx + 2];
-                    }
-                }
+                // Copy scaled image into padded canvas
+                img_tool::composite(*padded_img, *scaled_img, pad_x, pad_y);

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfc959b and 4d91711.

📒 Files selected for processing (30)

convert_hf_to_gguf.py
ggml/src/ggml-cuda/upscale.cu
ggml/src/ggml.c
gguf-py/gguf/constants.py
gguf-py/gguf/gguf_writer.py
gguf-py/gguf/tensor_mapping.py
src/llama-arch.cpp
src/llama-arch.h
src/llama-chat.cpp
src/llama-chat.h
src/llama-graph.cpp
src/llama-kv-cache.cpp
src/llama-model.cpp
src/llama-vocab.cpp
src/models/deepseek2.cpp
tools/mtmd/CMakeLists.txt
tools/mtmd/clip-impl.h
tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
tools/mtmd/clip.h
tools/mtmd/models/deepseekocr.cpp
tools/mtmd/models/glm4v.cpp
tools/mtmd/models/models.h
tools/mtmd/models/siglip.cpp
tools/mtmd/mtmd.cpp
tools/mtmd/tests.sh
tools/mtmd/tests/test-1-extracted.md
tools/mtmd/tests/test-1-extracted.txt
tools/mtmd/tests/test-deepseek-ocr.py
tools/mtmd/tests/tests-requirements.txt

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

src/llama-arch.h
tools/mtmd/clip.h
src/llama-chat.h
tools/mtmd/models/siglip.cpp
src/llama-vocab.cpp
src/models/deepseek2.cpp
tools/mtmd/models/models.h
src/llama-chat.cpp
tools/mtmd/models/deepseekocr.cpp
src/llama-kv-cache.cpp
tools/mtmd/mtmd.cpp
tools/mtmd/clip.cpp
src/llama-graph.cpp
tools/mtmd/models/glm4v.cpp
tools/mtmd/clip-model.h
src/llama-arch.cpp
src/llama-model.cpp
tools/mtmd/clip-impl.h

**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

src/llama-arch.h
tools/mtmd/clip.h
src/llama-chat.h
tools/mtmd/models/siglip.cpp
src/llama-vocab.cpp
src/models/deepseek2.cpp
tools/mtmd/models/models.h
gguf-py/gguf/tensor_mapping.py
tools/mtmd/tests/test-deepseek-ocr.py
src/llama-chat.cpp
tools/mtmd/models/deepseekocr.cpp
src/llama-kv-cache.cpp
gguf-py/gguf/gguf_writer.py
tools/mtmd/mtmd.cpp
tools/mtmd/clip.cpp
gguf-py/gguf/constants.py
src/llama-graph.cpp
tools/mtmd/models/glm4v.cpp
tools/mtmd/clip-model.h
src/llama-arch.cpp
src/llama-model.cpp
convert_hf_to_gguf.py
tools/mtmd/clip-impl.h

src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

src/llama-vocab.cpp
src/models/deepseek2.cpp
src/llama-chat.cpp
src/llama-kv-cache.cpp
src/llama-graph.cpp
src/llama-arch.cpp
src/llama-model.cpp

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

gguf-py/gguf/tensor_mapping.py
tools/mtmd/tests/test-deepseek-ocr.py
gguf-py/gguf/gguf_writer.py
gguf-py/gguf/constants.py
convert_hf_to_gguf.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

src/llama-arch.h
src/llama-kv-cache.cpp
src/llama-graph.cpp
src/llama-arch.cpp
src/llama-model.cpp

📚 Learning: 2025-11-29T22:55:53.865Z

Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

src/llama-arch.h

📚 Learning: 2025-05-26T09:45:20.653Z

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

gguf-py/gguf/tensor_mapping.py
tools/mtmd/mtmd.cpp
tools/mtmd/clip.cpp
tools/mtmd/CMakeLists.txt
tools/mtmd/clip-impl.h

🧬 Code graph analysis (8)

tools/mtmd/clip.h (1)

tools/mtmd/clip.cpp (4)

clip_is_deepseekocr (4080-4082)

clip_is_deepseekocr (4080-4080)

ctx (2690-2793)

ctx (2690-2690)

tools/mtmd/models/siglip.cpp (2)

ggml/src/ggml.c (1)

ggml_mul_mat (3174-3189)

tools/mtmd/clip.cpp (5)

model (218-220)

model (937-1266)

model (937-937)

model (1982-1995)

model (1982-1982)

tools/mtmd/models/models.h (2)

tools/mtmd/clip.cpp (14)

clip_graph (227-254)

clip_ctx (165-208)

clip_ctx (210-215)

ctx (2690-2793)

ctx (2690-2690)

img (1786-1786)

img (2283-2289)

img (2283-2283)

img (2329-2390)

img (2329-2329)

img (2399-2639)

img (2399-2399)

img (2795-2827)

img (2795-2795)

tools/mtmd/models/deepseekocr.cpp (2)

build (91-324)

build (91-91)

tools/mtmd/models/deepseekocr.cpp (1)

ggml/src/ggml.c (9)

ggml_reshape_4d (3583-3601)

ggml_cont (3461-3465)

ggml_permute (3700-3752)

ggml_view_4d (3676-3696)

ggml_reshape_3d (3564-3581)

ggml_interpolate (4949-4958)

ggml_get_rows (3776-3797)

ggml_cast (3433-3445)

ggml_concat (2517-2544)

gguf-py/gguf/gguf_writer.py (1)

gguf-py/gguf/constants.py (3)

Keys (20-332)

ClipVision (282-309)

SAM (306-309)

src/llama-model.cpp (1)

src/llama-model-loader.cpp (2)

create_tensor (794-815)

create_tensor (794-794)

convert_hf_to_gguf.py (2)

gguf-py/gguf/gguf_writer.py (8)

add_clip_projector_type (1076-1077)

add_vision_attention_layernorm_eps (1097-1098)

add_vision_use_gelu (1118-1119)

add_vision_projector_scale_factor (1124-1125)

add_vision_window_size (1115-1116)

add_vision_sam_layers_count (1133-1134)

add_vision_sam_embedding_length (1136-1137)

add_vision_sam_head_count (1139-1140)

gguf-py/gguf/constants.py (3)

VisionProjectorType (3491-3512)

GGMLQuantizationType (3366-3398)

MODEL_ARCH (347-456)

tools/mtmd/clip-impl.h (1)

ggml/src/ggml.c (3)

ggml_type_name (1282-1284)

ggml_op_desc (1306-1316)

ggml_nelements (1224-1228)

🪛 LanguageTool

tools/mtmd/tests/test-1-extracted.md

[grammar] ~48-~48: Use a hyphen to join words.
Context: ...det|> APOLLO CONTROL: The most major sky to sky will be for the 23 event, that is...

(QB_NEW_EN_HYPHEN)

[grammar] ~48-~48: Use a hyphen to join words.
Context: ...|> APOLLO CONTROL: The most major sky to sky will be for the 23 event, that is at...

(QB_NEW_EN_HYPHEN)

🪛 Ruff (0.14.10)

tools/mtmd/tests/test-deepseek-ocr.py

41-41: subprocess call: check for execution of untrusted input

(S603)

51-51: Avoid specifying long messages outside the exception class

(TRY003)

convert_hf_to_gguf.py

1698-1698: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

5998-5998: Avoid specifying long messages outside the exception class

(TRY003)

6008-6008: Unused method argument: new_name

(ARG002)

6008-6008: Unused method argument: bid

(ARG002)

6008-6008: Unused method argument: n_dims

(ARG002)

6015-6015: Unused method argument: bid

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: windows-2022-cmake-cuda (12.4)
GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
GitHub Check: windows-latest-cmake-hip
GitHub Check: macOS-latest-cmake-visionos
GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
GitHub Check: ubuntu-24-cmake-vulkan
GitHub Check: ubuntu-22-cmake-hip
GitHub Check: ubuntu-24-wasm-webgpu
GitHub Check: macOS-latest-cmake-arm64-webgpu
GitHub Check: macOS-latest-cmake-arm64
GitHub Check: server-windows
GitHub Check: pyright type-check

🔇 Additional comments (41)

tools/mtmd/tests.sh (3)

156-156: LGTM!

The flash attention parameter is correctly forwarded to the test command with proper quoting.

111-112: Verify that test validation logic works for OCR-specific tests.

These new OCR models use prompts like "Free OCR." and "extract all texts from this image", but the test validation logic (lines 169-175) still checks for "new york" or ("men" AND "walk"). This validation assumes the test image contains specific content. Ensure that the default test image (test-1.jpeg) actually contains text that would make OCR extraction meaningful and that the validation criteria remain appropriate.

111-111: Change --chat-template deepseek-ocr to --chat-template deepseek.

The official llama.cpp DeepSeek-OCR support PR uses --chat-template deepseek, not deepseek-ocr. Update the test command accordingly.

Likely an incorrect or invalid review comment.

tools/mtmd/CMakeLists.txt (1)

30-30: DeepSeekOCR model correctly added to mtmd target

Including models/deepseekocr.cpp in the mtmd library sources looks consistent with how other vision models are wired; no further build changes needed here.

tools/mtmd/models/models.h (1)

60-63: clip_graph_deepseekocr declaration matches existing clip_graph pattern

Constructor and build() override follow the same convention as other clip_graph_* structs; this is safe and consistent.

tools/mtmd/clip.h (1)

105-111: DeepSeekOCR predicate integrates cleanly with existing clip_is_ helpers*

clip_is_deepseekocr(const struct clip_ctx * ctx) matches the existing pattern (e.g., clip_is_llava, clip_is_gemma3) and its implementation uses ctx->proj_type() == PROJECTOR_TYPE_DEEPSEEKOCR, which is consistent with the new projector type.

src/llama-vocab.cpp (1)

2353-2375: DeepSeek OCR end-of-sentence token correctly marked as EOG

Adding "<｜end▁of▁sentence｜>" to the EOG-text list keeps special_eog_ids consistent with the earlier special_eot_id auto-detection for the same token, so generation will now stop on it as intended for DeepSeek OCR-style models.

tools/mtmd/tests/test-1-extracted.txt (1)

1-42: OCR test fixture content looks fine

The added text file is pure test data and is appropriate as a moderately complex OCR sample; no code or tooling concerns here.

tools/mtmd/clip-impl.h (1)

56-58: DeepSeek OCR keys, tensor names, and projector type wiring look consistent

SAM hyperparameter keys (KEY_SAM_N_HEAD, KEY_SAM_N_BLOCK, KEY_SAM_N_EMBD) follow the existing clip.vision.* naming scheme.

Updated TN_IMAGE_NEWLINE to "v.image_newline" and new TN_IMAGE_SEPERATOR/"v.view_seperator" align with the other v.* vision tensor prefixes and the strings used in the DeepSeek-OCR graph / converter.

TN_MM_PROJECTOR changed to "mm.model.fc.%s" is compatible with models that expose both weight and bias under the same prefix, and is used alongside PROJECTOR_TYPE_IDEFICS3 / PROJECTOR_TYPE_DEEPSEEKOCR paths.

TN_SAM_* macros for SAM blocks (pos embedding, qkv, neck, net, etc.) match the tensor names consumed in models/deepseekocr.cpp and give a clear, consistent naming pattern for SAM layers.

PROJECTOR_TYPE_DEEPSEEKOCR is added to the projector_type enum and registered in PROJECTOR_TYPE_NAMES as "deepseekocr", which keeps clip_projector_type_from_string() and ctx->proj_type() dispatch coherent.

Given this is a mirror of upstream wiring, these additions look correct and low-risk. Based on learnings, please surface any upstream inconsistencies to ngxson if you notice them during integration.

Also applies to: 100-107, 145-157, 180-238

tools/mtmd/mtmd.cpp (2)

177-177: LGTM: Proper flash attention type derivation

This change correctly derives the CLIP flash attention type from the user-provided parameter instead of hardcoding AUTO, allowing proper control over flash attention behavior.

834-835: LGTM: DeepSeekOCR encoding path

Appropriately adds DeepSeekOCR to the condition for non-batched per-image encoding, mirroring the existing GLM behavior. This aligns with the broader DeepSeekOCR integration.

tools/mtmd/models/siglip.cpp (1)

46-46: LGTM: Consistent projection weight renaming

The change from model.projection to model.fc_w aligns with the broader refactor to standardize projection weight naming across the codebase. This is consistent with similar changes in glm4v.cpp and other model files.

src/llama-chat.h (1)

31-31: LGTM: DeepSeekOCR chat template enumeration

The new LLM_CHAT_TEMPLATE_DEEPSEEK_OCR enum value is appropriately positioned after the other DeepSeek variants and supports the broader DeepSeekOCR integration.

src/llama-graph.cpp (1)

1169-1172: LGTM: Correct tensor reported to callback

This fixes the callback to report the actually weighted tensor (experts) instead of the pre-weighted tensor (cur) when weight_before_ffn is false. This makes the callback consistent with Line 1091, which reports cur when weights are applied before FFN.

tools/mtmd/tests/test-1-extracted.md (1)

1-85: LGTM: OCR test data with expected artifacts

This test data file provides realistic OCR output including structured markers and coordinates. The grammar warnings flagged by static analysis (e.g., "sky to sky" hyphenation at line 48) are likely intentional artifacts of real OCR output and should be preserved to ensure tests validate against realistic scenarios.

src/llama-arch.h (1)

74-74: LGTM: DeepSeekOCR architecture enumeration

The new LLM_ARCH_DEEPSEEK2OCR enum value is appropriately positioned after LLM_ARCH_DEEPSEEK2, following logical naming conventions and enabling OCR-specific architecture handling throughout the codebase.

tools/mtmd/models/glm4v.cpp (1)

98-98: LGTM: Consistent projection weight renaming

The change from model.projection to model.fc_w in the FC projector is consistent with the broader refactor to standardize projection weight naming across model implementations.

src/models/deepseek2.cpp (3)

7-7: LGTM: OCR flag initialization follows existing pattern.

The is_ocr flag is initialized consistently with the existing is_lite pattern.

85-88: Verify temperature scaling support for OCR path.

The OCR attention path doesn't apply inp_attn_scale (temperature scaling) before calling build_attn, unlike both the MLA path (lines 173-177) and non-MLA path (lines 211-215). If hparams.f_attn_temp_scale is set for an OCR model, this configuration would be silently ignored.

Confirm whether OCR models intentionally don't support temperature scaling or if this should be added for consistency.

Based on learnings, please relay this concern to the upstream contributor.

79-83: Reconsider hardcoded freq_base assertion for OCR rope configuration.

The OCR path asserts freq_base=10000.0 and applies simplified rope parameters (freq_scale=1, ext_factor=0, attn_factor=1, beta_fast=0, beta_slow=0), while the non-OCR path uses dynamic rope scaling parameters. The assertion prevents invalid models from running but will fail if any DeepSeek-OCR variant uses a different freq_base. Consider using the model's freq_base variable instead of asserting a fixed value, or document why all DeepSeek-OCR models must use freq_base=10000.0. The simplified rope parameters appear intentional for fixed-token OCR (unlike the variable-length context extension in DeepSeek-V2), but this design choice deserves a code comment explaining the rationale.

src/llama-chat.cpp (2)

52-52: LGTM: Chat template mapping added correctly.

The "deepseek-ocr" mapping follows the existing pattern.

545-549: LGTM: Passthrough implementation is appropriate for OCR.

The implementation correctly concatenates message content without applying template formatting, which is appropriate for OCR use cases where the content is likely image data or pre-formatted text.

src/llama-arch.cpp (2)

70-70: LGTM: Architecture name added correctly.

The architecture name mapping follows the existing pattern and uses a descriptive name.

1466-1497: Verify tensor list completeness for OCR variant.

The DEEPSEEK2OCR tensor list includes both simple attention tensors (LLM_TENSOR_ATTN_Q, LLM_TENSOR_ATTN_K, LLM_TENSOR_ATTN_V at lines 1474-1476) and MLA-style tensors (LLM_TENSOR_ATTN_Q_A, LLM_TENSOR_ATTN_Q_B, LLM_TENSOR_ATTN_KV_A_MQA, etc., at lines 1477-1483). The OCR implementation in src/models/deepseek2.cpp (lines 68-70) only uses the simple wq/wk/wv tensors.

Confirm whether:

The MLA tensors are included for future flexibility or model variants.

All tensors in this list are actually used/loaded for OCR models, or if some are unused.

Including unused tensors in the list doesn't break functionality but could cause unnecessary memory allocation attempts during model loading.

Based on learnings, please relay this concern to the upstream contributor.

src/llama-model.cpp (2)

1619-1663: DeepSeek2OCR hparams/type and RoPE classification look consistent with DeepSeek2

Treating LLM_ARCH_DEEPSEEK2OCR as a sibling of LLM_ARCH_DEEPSEEK2 in load_hparams() and adding is_ocr to gate Q_LORA_RANK / KV_LORA_RANK reads makes sense and avoids requiring absent keys for OCR GGUFs.

The added case 12: type = LLM_TYPE_3B; in the DeepSeek2* block is reasonable for the 12‑layer 3B OCR LM and also harmless for any 12‑layer non‑OCR DeepSeek2 variants.

Mapping LLM_ARCH_DEEPSEEK2OCR to LLAMA_ROPE_TYPE_NORM alongside LLM_ARCH_DEEPSEEK2 aligns with the upstream “Fix RoPE type for DeepSeek‑OCR LM” intent and keeps RoPE handling consistent.

Also applies to: 7834-7835

4685-4713: is_ocr tensor layout and graph builder reuse for DeepSeek2OCR are coherent

The new is_ocr flag in the LLM_ARCH_DEEPSEEK2/DEEPSEEK2OCR tensor‑creation block cleanly splits the OCR path: OCR layers build standard {n_embd, n_embd} Q/K/V/O and the usual dense/MoE FFN + shared‑expert tensors, then continue, so MLA‑specific and LoRA‑specific tensors are never expected for OCR.

Shapes in the OCR path mirror the existing DeepSeek MoE layout (including ffn_gate_inp, ffn_gate_exps, ffn_down_exps, ffn_up_exps, and shared‑expert tensors), which should match the DeepSeek‑OCR GGUF weights as exported in the upstream PR.

Routing both LLM_ARCH_DEEPSEEK2 and LLM_ARCH_DEEPSEEK2OCR through llm_build_deepseek2 in build_graph() keeps the execution path unified while still allowing the builder to branch based on which tensors were instantiated.

Also applies to: 4715-4742, 7482-7485

gguf-py/gguf/gguf_writer.py (1)

1115-1140: LGTM!

The new vision/SAM writer methods follow the established patterns in this file. They correctly use add_uint32 for integer metadata and reference the appropriate keys from Keys.ClipVision.WINDOW_SIZE and Keys.ClipVision.SAM.* defined in constants.py.

gguf-py/gguf/tensor_mapping.py (2)

1528-1578: LGTM!

The SAM tensor mappings are comprehensive and follow the established pattern. The tensor paths (model.sam_model.*) align with the usage in deepseekocr.cpp.

1272-1274: The tensor name uses model.view_seperator (note: "seperator" not "separator"), which matches DeepSeek-OCR's actual model weights. Since this exact spelling appears identically in both tensor_mapping.py and constants.py, it is intentional and necessary for correct tensor lookup during model conversion. No change needed.

tools/mtmd/models/deepseekocr.cpp (5)

5-24: LGTM!

The window_partition function correctly implements sliding window partitioning with proper padding calculation and dimension reshaping. The implementation follows the referenced approach from Acly's comment.

28-51: LGTM!

The window_unpartition function correctly reverses the window partition operation, properly handling the removal of padding via ggml_view_4d.

53-89: LGTM!

The get_rel_pos function correctly handles relative position embedding extraction with optional bilinear interpolation for size mismatches. The assertions provide good input validation.

294-324: LGTM!

The final section correctly combines SAM and CLIP outputs, applies the projection layer, and adds the special tokens (image newline and view separator). The square dimension assumption aligns with the expected output format.

276-276: This review comment flagged a non-existent scope issue. The n_embd variable declared at line 98 in the SAM building block is properly accessible at line 276, which is within a nested conditional inside the same block. In C++, variables declared in an outer scope are accessible in all nested inner scopes.

tools/mtmd/tests/test-deepseek-ocr.py (1)

16-55: LGTM!

The run_mtmd_deepseek_ocr function properly handles subprocess execution with timeout, captures output, and provides error handling. The static analysis hint (S603) about untrusted input is a false positive in this test context where arguments come from controlled CLI defaults.

convert_hf_to_gguf.py (1)

715-721: DeepSeek-OCR: language_config → text_config remap looks correct

The new language_config handling matches the released DeepSeek-OCR config.json structure and ensures TextModel sees the language side as text_config. This should let the DeepseekV2 text decoder reuse the existing DeepseekV2 path without special casing. I don’t see issues here.

gguf-py/gguf/constants.py (1)

293-309: LGTM - DeepSeek-OCR constants are well-structured.

The new architecture, tensor definitions, and mappings follow existing conventions consistently. The SAM-related keys and tensor names are properly organized under the ClipVision namespace, and the DEEPSEEK2OCR architecture mirrors DEEPSEEK2 appropriately for an OCR variant.

Also applies to: 413-413, 698-712, 823-823, 1107-1121, 2441-2472, 3317-3320, 3510-3510

tools/mtmd/clip-model.h (2)

65-119: LGTM - SAM hyperparameters and attention helper methods.

The SAM-specific fields (sam_n_layer, sam_n_head, sam_n_embd) follow existing patterns with sensible defaults. The global_attn_indices() and is_global_attn() helper methods correctly implement SAM's global attention layer selection pattern.

166-168: LGTM - SAM layer and encoder fields are properly structured.

The relative position tensors in clip_layer and SAM encoder fields in clip_model are correctly initialized and follow existing conventions. The sam_layers vector appropriately mirrors the existing layers pattern for the SAM encoder.

Also applies to: 340-343, 354-356

tools/mtmd/clip.cpp (2)

4080-4082: LGTM!

The clip_is_deepseekocr predicate function follows the established pattern of other projector type checks in this file.

839-843: LGTM - DEEPSEEKOCR graph builder integration.

The new case for PROJECTOR_TYPE_DEEPSEEKOCR follows the established pattern for adding new projector types to the graph building switch statement.

coderabbitai · 2025-12-23T00:09:22Z

convert_hf_to_gguf.py

+@ModelBase.register("DeepseekOCRForCausalLM")
+class DeepseekOCRVisionModel(MmprojModel):
+    def set_gguf_parameters(self):
+        super().set_gguf_parameters()
+        hparams = self.hparams
+        self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.DEEPSEEKOCR)
+        # default values below are taken from HF tranformers code
+        self.gguf_writer.add_vision_attention_layernorm_eps(hparams.get("layer_norm_eps", 1e-6))
+        self.gguf_writer.add_vision_use_gelu(True)
+        # calculate proj_scale_factor (used by tinygemma3 test model)
+        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
+        n_per_side = int(image_seq_length ** 0.5)
+        image_size = self.hparams["image_size"]
+        patch_size = self.hparams["patch_size"]
+        proj_scale_factor = (image_size // patch_size) // n_per_side
+        if proj_scale_factor > 0 and proj_scale_factor != 4:
+            # we only need to write this if it's not the default value
+            # in this case, we are converting a test model
+            self.gguf_writer.add_vision_projector_scale_factor(proj_scale_factor)
+        # @bluebread: there's no window_size in config but just add it here anyway
+        self.gguf_writer.add_vision_window_size(self.hparams.get("window_size", 14))
+
+        # SAM configuration
+        sam_hparams = hparams['sam']
+        self.gguf_writer.add_vision_sam_layers_count(sam_hparams['layers'])
+        self.gguf_writer.add_vision_sam_embedding_length(sam_hparams['width'])
+        self.gguf_writer.add_vision_sam_head_count(sam_hparams['heads'])
+
+    def get_vision_config(self) -> dict[str, Any]:
+        vision_config: dict[str, Any] | None = self.global_config.get("vision_config")
+
+        if not vision_config:
+            raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found")
+
+        vision_config['sam'] = vision_config['width']['sam_vit_b']
+        vision_config.update(vision_config['width']['clip-l-14-224'])
+        vision_config['hidden_size'] = vision_config['width']
+        vision_config['num_heads'] = vision_config['heads']
+        vision_config['intermediate_size'] = vision_config['heads'] * 4
+
+        return vision_config


🛠️ Refactor suggestion | 🟠 Major

DeepSeek-OCR: intermediate_size derivation in get_vision_config looks incorrect

get_vision_config() currently sets:

vision_config['hidden_size'] = vision_config['width'] vision_config['num_heads'] = vision_config['heads'] vision_config['intermediate_size'] = vision_config['heads'] * 4

For the released DeepSeek-OCR config, CLIP-L has width = 1024, heads = 16, and mlp_ratio ≈ 3.7362, so the MLP/FFN dim should be on the order of hidden_size * mlp_ratio (~3.8k), not heads * 4 (64). That gives nonsense FFN metadata for the vision encoder and may break consumers that rely on GGUF metadata instead of inferring from weights.

I’d strongly recommend computing intermediate_size from the CLIP config instead of heads * 4, e.g.:

Proposed fix for DeepSeek-OCR vision config

def get_vision_config(self) -> dict[str, Any]: vision_config: dict[str, Any] | None = self.global_config.get("vision_config") if not vision_config: raise ValueError("DeepseekOCR model requires 'vision_config' in the model configuration, but it was not found") - vision_config['sam'] = vision_config['width']['sam_vit_b'] - vision_config.update(vision_config['width']['clip-l-14-224']) - vision_config['hidden_size'] = vision_config['width'] - vision_config['num_heads'] = vision_config['heads'] - vision_config['intermediate_size'] = vision_config['heads'] * 4 + # Extract SAM and CLIP-L configs from nested vision_config["width"] + width_cfg = vision_config["width"] + sam_cfg = width_cfg["sam_vit_b"] + clip_cfg = width_cfg["clip-l-14-224"] + + # Promote SAM config and CLIP-L config to top level + vision_config["sam"] = sam_cfg + vision_config.update(clip_cfg) + + # Normalize fields expected by MmprojModel / gguf_writer + vision_config["hidden_size"] = clip_cfg["width"] + vision_config["num_heads"] = clip_cfg["heads"] + + mlp_ratio = vision_config.get("mlp_ratio") + if mlp_ratio is not None: + vision_config["intermediate_size"] = int(round(clip_cfg["width"] * mlp_ratio)) + else: + # Fallback: standard ViT MLP ratio + vision_config["intermediate_size"] = clip_cfg["width"] * 4 return vision_config

Static-analysis nits in this class

tensor_force_quant(self, name, new_name, bid, n_dims) doesn’t use new_name, bid, or n_dims.

modify_tensors(self, data_torch, name, bid) doesn’t use bid.

To keep the signature compatible with the base class while satisfying linters, consider explicitly deleting the unused arguments:

Suggested cleanups for unused parameters

def tensor_force_quant(self, name, new_name, bid, n_dims): - if ".embeddings." in name or 'pos_embed' in name: + del new_name, bid, n_dims # unused, required by base signature + if ".embeddings." in name or 'pos_embed' in name: return gguf.GGMLQuantizationType.F32 if ".rel_pos_h" in name or '.rel_pos_w' in name: return gguf.GGMLQuantizationType.F32 return gguf.GGMLQuantizationType.F16 def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]: - # Only process vision-related tensors, skip language model tensors + del bid # unused + # Only process vision-related tensors, skip language model tensors

🧰 Tools

🪛 Ruff (0.14.10)

5998-5998: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In convert_hf_to_gguf.py around lines 5966 to 6006, the DeepseekOCRVisionModel.get_vision_config sets intermediate_size incorrectly as heads * 4 and leaves some method parameters unused; change intermediate_size to compute from the model hidden size and an mlp_ratio (e.g. intermediate_size = int(vision_config.get("mlp_ratio", 4.0) * vision_config["hidden_size"])) so it matches released CLIP configs, ensure hidden_size and num_heads are set from the correct keys before computing mlp_ratio, and fix unused-parameter lint warnings in tensor_force_quant and modify_tensors by either deleting the unused args inside the methods (e.g. del new_name, del bid, del n_dims) or renaming them to _new_name, _bid, _n_dims to keep signatures compatible.

coderabbitai · 2025-12-23T00:09:22Z

convert_hf_to_gguf.py

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        hparams: dict = ModelBase.load_hparams(self.dir_model, is_mistral_format=False)
+        self.origin_hf_arch = hparams.get('architectures', [None])[0]
+
+        if self.origin_hf_arch == "DeepseekOCRForCausalLM":
+            self.model_arch = gguf.MODEL_ARCH.DEEPSEEK2OCR
+            self.gguf_writer.arch = gguf.MODEL_ARCH_NAMES[self.model_arch]
+            self.gguf_writer.add_architecture()
+


⚠️ Potential issue | 🟡 Minor

DeepseekV2 OCR path: minor rope metadata duplication; rest looks good

The OCR-specific wiring in DeepseekV2Model looks consistent with the DeepSeek-OCR config:

__init__ detecting origin_hf_arch == "DeepseekOCRForCausalLM" and switching model_arch to DEEPSEEK2OCR (plus updating gguf_writer.arch and adding the architecture) matches config.json and is necessary so OCR decoders are distinguishable from standard DeepSeek-V2.

In set_gguf_parameters:

For OCR (is_ocr), you avoid forcing num_key_value_heads = 1 and skip MLA key/value fields, which is correct given use_mla: false and the qk_* dims being zero in the language config.

For non-OCR, the MLA-specific kv_lora_rank defaults and add_key_length*_mla / add_value_length*_mla calls look aligned with existing DeepSeek-V2 handling.

New routed_scaling_factor and norm_topk_prob defaults wired into add_expert_weights_scale / add_expert_weights_norm make sense.

One small issue: for non-OCR models, add_rope_dimension_count(hparams["qk_rope_head_dim"]) is now called twice — once inside the if not is_ocr: block and again unconditionally at the end of the method. It’s harmless but redundant and will produce duplicate GGUF fields.

Proposed deduplication of `add_rope_dimension_count`

def set_gguf_parameters(self): is_ocr = (self.model_arch == gguf.MODEL_ARCH.DEEPSEEK2OCR) @@ # note: deepseek2 using MLA converts into MQA with larger heads, then decompresses to MHA - if not is_ocr: - self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"]) - self.gguf_writer.add_value_length(kv_lora_rank) - self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"]) - self.gguf_writer.add_value_length_mla(hparams["v_head_dim"]) - self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"]) + if not is_ocr: + self.gguf_writer.add_key_length(kv_lora_rank + hparams["qk_rope_head_dim"]) + self.gguf_writer.add_value_length(kv_lora_rank) + self.gguf_writer.add_key_length_mla(hparams["qk_nope_head_dim"] + hparams["qk_rope_head_dim"]) + self.gguf_writer.add_value_length_mla(hparams["v_head_dim"]) @@ - self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"]) - self.gguf_writer.add_expert_weights_scale(routed_scaling_factor) - self.gguf_writer.add_expert_weights_norm(norm_topk_prob) - - self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"]) + self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"]) + self.gguf_writer.add_expert_weights_scale(routed_scaling_factor) + self.gguf_writer.add_expert_weights_norm(norm_topk_prob) + + # Common RoPE dim for both OCR and non-OCR variants + self.gguf_writer.add_rope_dimension_count(hparams["qk_rope_head_dim"])

The additions in modify_tensors to also skip sam_model, image_newline, and view_seperator tensors look correct for DeepSeek-OCR: those should be handled by the mmproj/vision stream rather than the text model.

Also applies to: 7259-7274, 7279-7288, 7292-7295, 7307-7313

coderabbitai · 2025-12-23T00:09:22Z

gguf-py/gguf/constants.py

+    MODEL_TENSOR.V_SAM_NET_2:               "v.sam.net_2",
+    MODEL_TENSOR.V_SAM_NET_3:               "v.sam.net_3",
+    MODEL_TENSOR.V_ENC_EMBD_IMGNL:          "v.image_newline", # Deepseek-OCR
+    MODEL_TENSOR.V_ENC_EMBD_VSEP:           "v.view_seperator", # Deepseek-OCR


⚠️ Potential issue | 🟡 Minor

Typo: "seperator" should be "separator".

The string "v.view_seperator" contains a spelling error. This should be "v.view_separator" for correctness.

Note: This typo also appears in tools/mtmd/clip-model.h at line 244 (view_seperator). Both should be corrected together to maintain consistency.

🔎 Proposed fix

- MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_seperator", # Deepseek-OCR + MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_separator", # Deepseek-OCR

🤖 Prompt for AI Agents

In gguf-py/gguf/constants.py around line 1122 (and also in tools/mtmd/clip-model.h at line 244), there is a spelling mistake "v.view_seperator" that should be "v.view_separator"; update both occurrences to the correct spelling to keep the keys consistent across code and any serialized formats (perform a targeted rename in both files and run any tests/linters to ensure no other references remain).

coderabbitai · 2025-12-23T00:09:22Z

tools/mtmd/clip-model.h

    ggml_tensor * mm_2_b = nullptr;

    ggml_tensor * image_newline = nullptr;
+    ggml_tensor * view_seperator = nullptr;


⚠️ Potential issue | 🟡 Minor

Typo: "view_seperator" should be "view_separator".

This matches the typo flagged in gguf-py/gguf/constants.py. Both should be corrected together.

🔎 Proposed fix

- ggml_tensor * view_seperator = nullptr; + ggml_tensor * view_separator = nullptr;

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tools/mtmd/clip-model.h around line 244, the variable name "view_seperator" contains a typo; rename it to "view_separator" and update all references in this file to the new name, ensuring declarations, initializations, and any uses/assignments (including nullptr) are consistently changed; also update the corresponding typo in gguf-py/gguf/constants.py in the same commit so both places match.

coderabbitai · 2025-12-23T00:09:22Z

tools/mtmd/clip-model.h

+    ggml_tensor * neck_0_w;
+    ggml_tensor * neck_1_w;
+    ggml_tensor * neck_1_b;
+    ggml_tensor * neck_2_w;
+    ggml_tensor * neck_3_w;
+    ggml_tensor * neck_3_b;
+    ggml_tensor * net_2;
+    ggml_tensor * net_3;


⚠️ Potential issue | 🟠 Major

Missing nullptr initialization for tensor pointers.

These tensor pointers are not initialized to nullptr, unlike most other tensor pointers in this struct. This inconsistency could lead to undefined behavior if the pointers are checked or accessed before being explicitly set.

🔎 Proposed fix

- ggml_tensor * neck_0_w; - ggml_tensor * neck_1_w; - ggml_tensor * neck_1_b; - ggml_tensor * neck_2_w; - ggml_tensor * neck_3_w; - ggml_tensor * neck_3_b; - ggml_tensor * net_2; - ggml_tensor * net_3; + ggml_tensor * neck_0_w = nullptr; + ggml_tensor * neck_1_w = nullptr; + ggml_tensor * neck_1_b = nullptr; + ggml_tensor * neck_2_w = nullptr; + ggml_tensor * neck_3_w = nullptr; + ggml_tensor * neck_3_b = nullptr; + ggml_tensor * net_2 = nullptr; + ggml_tensor * net_3 = nullptr;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ggml_tensor * neck_0_w;

ggml_tensor * neck_1_w;

ggml_tensor * neck_1_b;

ggml_tensor * neck_2_w;

ggml_tensor * neck_3_w;

ggml_tensor * neck_3_b;

ggml_tensor * net_2;

ggml_tensor * net_3;

ggml_tensor * neck_0_w = nullptr;

ggml_tensor * neck_1_w = nullptr;

ggml_tensor * neck_1_b = nullptr;

ggml_tensor * neck_2_w = nullptr;

ggml_tensor * neck_3_w = nullptr;

ggml_tensor * neck_3_b = nullptr;

ggml_tensor * net_2 = nullptr;

ggml_tensor * net_3 = nullptr;

🤖 Prompt for AI Agents

In tools/mtmd/clip-model.h around lines 345 to 352, the tensor pointer fields (neck_0_w, neck_1_w, neck_1_b, neck_2_w, neck_3_w, neck_3_b, net_2, net_3) are not initialized like other tensor pointers; initialize each of these pointers to nullptr at declaration to match the struct's initialization pattern and avoid undefined behavior, and also verify whether net_2/net_3 are typos (should they be named neck_2/neck_3) and rename them if appropriate.

coderabbitai · 2025-12-23T00:09:23Z

tools/mtmd/clip.cpp

+                    model.sam_layers.resize(model.n_sam_layers);
+                    for (int il = 0; il < model.n_sam_layers; ++il) {
+                        auto & layer    = model.sam_layers[il];
+                        layer.qkv_w     = get_tensor(string_format(TN_SAM_ATTN_QKV, il, "weight"));
+                        layer.qkv_b     = get_tensor(string_format(TN_SAM_ATTN_QKV, il, "bias"));
+                        layer.o_w       = get_tensor(string_format(TN_SAM_ATTN_OUT, il, "weight"));
+                        layer.o_b       = get_tensor(string_format(TN_SAM_ATTN_OUT, il, "bias"));
+                        layer.ln_1_w    = get_tensor(string_format(TN_SAM_PRE_NORM, il, "weight"));
+                        layer.ln_1_b    = get_tensor(string_format(TN_SAM_PRE_NORM, il, "bias"));
+                        layer.ln_2_w    = get_tensor(string_format(TN_SAM_POST_NORM, il, "weight"));
+                        layer.ln_2_b    = get_tensor(string_format(TN_SAM_POST_NORM, il, "bias"));
+                        layer.rel_pos_h = get_tensor(string_format(TN_SAM_ATTN_POS_H, il));
+                        layer.rel_pos_w = get_tensor(string_format(TN_SAM_ATTN_POS_W, il));
+                        layer.ff_up_w   = get_tensor(string_format(TN_SAM_FFN_UP, il, "weight"));
+                        layer.ff_up_b   = get_tensor(string_format(TN_SAM_FFN_UP, il, "bias"));
+                        layer.ff_down_w = get_tensor(string_format(TN_SAM_FFN_DOWN, il, "weight"));
+                        layer.ff_down_b = get_tensor(string_format(TN_SAM_FFN_DOWN, il, "bias"));
+                    }
+                    model.neck_0_w       = get_tensor(string_format(TN_SAM_NECK, 0, "weight"));
+                    model.neck_1_b       = get_tensor(string_format(TN_SAM_NECK, 1, "bias"));
+                    model.neck_1_w       = get_tensor(string_format(TN_SAM_NECK, 1, "weight"));
+                    model.neck_2_w       = get_tensor(string_format(TN_SAM_NECK, 2, "weight"));
+                    model.neck_3_b       = get_tensor(string_format(TN_SAM_NECK, 3, "bias"));
+                    model.neck_3_w       = get_tensor(string_format(TN_SAM_NECK, 3, "weight"));
+                    model.net_2          = get_tensor(string_format(TN_SAM_NET, 2, "weight"));
+                    model.net_3          = get_tensor(string_format(TN_SAM_NET, 3, "weight"));
+                    model.image_newline  = get_tensor(TN_IMAGE_NEWLINE);
+                    model.view_seperator = get_tensor(TN_IMAGE_SEPERATOR);
+                    model.fc_w           = get_tensor(string_format(TN_MM_PROJECTOR, "weight"));
+                    model.fc_b           = get_tensor(string_format(TN_MM_PROJECTOR, "bias"));
+                 } break;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for n_sam_layers in the codebase rg -n "n_sam_layers" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 204

🏁 Script executed:

# Also search for where n_sam_layers is assigned more broadly rg -n "n_sam_layers\s*=" --type cpp

Repository: ngxson/llama.cpp

Length of output: 154

🏁 Script executed:

# Check the hparams loading section mentioned at lines 1200-1210 sed -n '1200,1210p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 633

🏁 Script executed:

# Examine lines around 1650 and 1679 to see context sed -n '1640,1680p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 3451

🏁 Script executed:

# Search for where model.n_sam_layers is assigned in the entire codebase rg -n "\.n_sam_layers\s*=" tools/mtmd/ --type cpp

Repository: ngxson/llama.cpp

Length of output: 42

🏁 Script executed:

# Also check clip-model.h to see the complete struct definition and any initialization grep -n "n_sam_layers" tools/mtmd/clip-model.h -B 2 -A 2

Repository: ngxson/llama.cpp

Length of output: 213

🏁 Script executed:

# Look for hparams.sam_n_layer usage to see where it's assigned to model rg -n "hparams\.sam_n_layer" tools/mtmd/ --type cpp

Repository: ngxson/llama.cpp

Length of output: 247

🏁 Script executed:

# Check exact indentation of line 1679 by looking at raw bytes sed -n '1679p' tools/mtmd/clip.cpp | od -c | head -20

Repository: ngxson/llama.cpp

Length of output: 100

🏁 Script executed:

# Check indentation with cat -A to show all whitespace sed -n '1679p' tools/mtmd/clip.cpp | cat -A

Repository: ngxson/llama.cpp

Length of output: 86

🏁 Script executed:

# Check if hparams.sam_n_layer is ever assigned to model.n_sam_layers rg -n "model\.n_sam_layers\s*=" tools/mtmd/ --type cpp -A 2 -B 2

Repository: ngxson/llama.cpp

Length of output: 42

🏁 Script executed:

# Let's look at the broader context around line 1679 to understand the structure better sed -n '1670,1680p' tools/mtmd/clip.cpp | cat -n

Repository: ngxson/llama.cpp

Length of output: 1017

🏁 Script executed:

# Check how other projector types handle similar assignments to understand the pattern grep -n "hparams\." tools/mtmd/clip.cpp | grep "model\." -A 1 | head -30

Repository: ngxson/llama.cpp

Length of output: 1826

🏁 Script executed:

# Look for where hparams values are transferred to model for other projector types sed -n '1600,1660p' tools/mtmd/clip.cpp | grep -E "(hparams|model\.)" | head -20

Repository: ngxson/llama.cpp

Length of output: 1908

🏁 Script executed:

# Check the exact indentation context - view more lines around 1679 sed -n '1675,1685p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 766

Assign hparams.sam_n_layer to model.n_sam_layers for DEEPSEEKOCR projector type.

The hparams loading at line 1206 reads sam_n_layer from the GGUF file, but this value is never assigned to model.n_sam_layers. The code at line 1650 then uses model.n_sam_layers (hardcoded default of 12) instead of the actual value from hparams, causing the loop to always resize to 12 layers regardless of what's in the model file.

Also, line 1679 has inconsistent indentation (extra leading space before the closing brace).

coderabbitai · 2025-12-23T00:09:23Z

tools/mtmd/tests.sh

+# Check if the second argument is "flash", then enable flash attention
+# This is useful to test if flash attention off works correctly
+FLASH_ATTN="on"
+if [ "${2:-}" = "flash_off" ] || [ "${1:-}" = "flash_off" ]; then
+    FLASH_ATTN="off"
+    echo "Flash attention disabled..."
+fi


⚠️ Potential issue | 🟡 Minor

Comment is inconsistent with implementation.

The comment on line 31 states "Check if the second argument is 'flash'" but the code on line 34 checks both the first AND second arguments. This inconsistency could confuse future maintainers.

🔎 Proposed fix to align comment with implementation

-# Check if the second argument is "flash", then enable flash attention +# Check if either argument is "flash_off", then disable flash attention # This is useful to test if flash attention off works correctly

🤖 Prompt for AI Agents

In tools/mtmd/tests.sh around lines 31 to 37, the comment incorrectly says it checks only the second argument but the code checks both first and second arguments for "flash_off"; update the comment to accurately describe that both the first and second positional arguments are examined (or adjust the conditional to match the original intent), and keep wording concise e.g. "Check if the first or second argument is 'flash_off', then disable flash attention" so the comment and implementation align.

coderabbitai · 2025-12-23T00:09:23Z

tools/mtmd/tests.sh

    add_test_vision "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M"
    # add_test_vision "ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M" # does not work on my mac M3 Ultra
    # add_test_vision "ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M" # not always working
+    add_test_vision "sabafallah/DeepSeek-OCR-GGUF:q8_0" -p "Free OCR." --chat-template deepseek-ocr


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the model with lowercase q8_0 exists via HuggingFace API curl -s "https://huggingface.co/api/models/sabafallah/DeepSeek-OCR-GGUF" | jq -r '.siblings[].rn' 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 99

🏁 Script executed:

# Search for deepseek-ocr chat template in the codebase fd -e sh -e py -e json | xargs grep -l "deepseek-ocr" 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 117

🏁 Script executed:

# Find and check test validation logic in tests.sh around the OCR tests sed -n '169,175p' tools/mtmd/tests.sh

Repository: ngxson/llama.cpp

Length of output: 315

🏁 Script executed:

# Check all quantization formats used in tests.sh to confirm the pattern grep -o ":[Qq][0-9]_[0-9]" tools/mtmd/tests.sh | sort | uniq -c

Repository: ngxson/llama.cpp

Length of output: 101

Use uppercase Q8_0 for consistency with other quantization formats.

The DeepSeek-OCR model uses lowercase q8_0 while all other models in the test suite use uppercase quantization formats (Q8_0, Q4_0). Repository indexing or model matching may be case-sensitive, causing consistency issues. Change to uppercase Q8_0 to match the pattern used throughout the test suite.

Since this is a mirror PR, please relay this formatting concern to the upstream contributor.

🤖 Prompt for AI Agents

In tools/mtmd/tests.sh around line 111, the model tag uses lowercase "q8_0" while the test suite standard is uppercase quantization labels; update the tag to "Q8_0" for consistency (replace "sabafallah/DeepSeek-OCR-GGUF:q8_0" with "sabafallah/DeepSeek-OCR-GGUF:Q8_0") and also add a short note in the PR description requesting the upstream contributor to use uppercase quantization labels in their model naming to avoid case-sensitive matching issues.

coderabbitai · 2025-12-23T00:09:23Z

tools/mtmd/tests/test-deepseek-ocr.py

+    print(f"Result: {'PASS' if free_ocr_similarity >= args.threshold else 'FAIL'}")
+    print("=" * 60)
+
+    # Markdown OCR results
+    print(f"\nReference Model Markdown output:\n{'-' * 40}")
+    print(expected_md_ocr)
+    print(f"\nDeepSeek-OCR Markdown output:\n{'-' * 40}")
+    print(llama_md_ocr)
+    print(f"\n{'=' * 60}")
+    print(f"Cosine Similarity (Markdown): {md_ocr_similarity:.4f}")
+    print(f"Threshold: {args.threshold}")
+    print(f"Result: {'PASS' if md_ocr_similarity >= args.threshold else 'FAIL'}")
+    print("=" * 60)


⚠️ Potential issue | 🟡 Minor

Test doesn't exit with non-zero code on failure.

For CI integration, the script should exit with a non-zero code when similarity falls below the threshold. Currently, it prints "FAIL" but exits successfully.

🔎 Proposed fix

print(f"Result: {'PASS' if md_ocr_similarity >= args.threshold else 'FAIL'}") print("=" * 60) + + # Exit with non-zero code if any test fails + if free_ocr_similarity < args.threshold or md_ocr_similarity < args.threshold: + sys.exit(1) if __name__ == "__main__": main()

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tools/mtmd/tests/test-deepseek-ocr.py around lines 170 to 182, the script prints "PASS"/"FAIL" but always exits with code 0; update the end of this section to exit non‑zero on failure by evaluating both free_ocr_similarity and md_ocr_similarity against args.threshold and calling sys.exit(1) if either is below the threshold (sys.exit(0) or no explicit exit if both pass). Ensure sys is imported at the top of the file if not already, and place the exit call after the printed summary so CI receives a failing exit code when similarity checks fail.

sfallah and others added 30 commits November 14, 2025 12:40

mtmd: llama.cpp DeepSeekOCR support

43a130b

init commit

loading sam tensors

b6b9f02

mtmd: fix vision model processing

85c7cda

Merge pull request #1 from bluebread/sf/deepseek-ocr

578c8d7

mtmd: fix vision model processing

deepseek-ocr clip-vit model impl

2aab52e

mtmd: add DeepSeek-OCR LM support with standard attention

eab28ed

mtmd: successfully runs DeepSeek-OCR LM in llama-cli

7630587

mtmd: Fix RoPE type for DeepSeek-OCR LM.

2de3436

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

e8b2610

…f/deepseek-ocr

loading LM

97e0907

testing Vision model loading

Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr

13dc6fb

Merge pull request #2 from bluebread/sf/deepseek-ocr

b32bb5e

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

sam warmup working

790bbb9

sam erroneous return corrected

cec9a5c

clip-vit: corrected cls_embd concat

8b3d319

clip-vit: model convert qkv_proj split

1e08157

corrected combining of image encoders' results

331cea8

fix: update callback for ffn_moe_weighted and add callback for attn_o…

6c0715b

…ut in deepseek2 model

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

a65ddf5

…f/deepseek-ocr

concat image_newline and image_seperator tokens

63a042f

visual_model warmup (technically) works

89afda8

window partitioning using standard ggml ops

88032f4

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

1268dc3

…f/deepseek-ocr

sam implementation without using CPU only ops

68b206b

clip: fixed warnings

8bce66d

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

5e6cf3c

…f/deepseek-ocr

mtmd: fix get_rel_pos

7e9fbec

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

0f5587d

…f/deepseek-ocr

mtmd: fixed the wrong scaler for get_rel_pos

7b8d735

image encoding technically works but the output can't be checked sing…

86f111f

…e image decoding fails

sfallah and others added 23 commits December 13, 2025 10:59

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr-me…

e0e69fd

…rge_#17965 # Conflicts: # src/llama-kv-cache.cpp # tools/mtmd/clip.cpp

quick and (potential) dirty merge with ggml-org#17909

f95a6fe

refactoring, one single builder function and static helpers

f7736f2

added deepseek-ocr test to tests.sh

fb3bb6a

Merge pull request #11 from sfallah/sf/deepseek-ocr-merge_#17965

1b38ccf

Merged with PR ggml-org#17965

minor formatting fixes

6c36c03

check with fixed expected resutls

dc2066e

Merge pull request #10 from sfallah/sf/deepseek-ocr-test-script

3fc61d4

python test script for deepseek-ocr testing OCR on text-1.jpeg newspaper image checking against expected reference model output for Free-OCR and Markdown

minor formatting

7f8621c

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

b3bf8cb

# Conflicts: # convert_hf_to_gguf.py

editorconfig-check fix

8ad98ee

Merge branch 'ggml-org:master' into sf/deepseek-ocr

4a4f829

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

51c3de6

# Conflicts: # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # tools/mtmd/clip-impl.h # tools/mtmd/clip.cpp # tools/mtmd/models/models.h

merge with changes from ggml-org#18042

512b2c8

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

00d2357

# Conflicts: # src/llama-arch.cpp

minor

87e4a00

- added GLM-4.6V to big tests - added missing deps for python test

convert: minor fix

f629d02

mtmd: format code

5a741fd

convert: quick fix

616f009

convert: quick fix

e5d426b

minor python formatting

c739cf2

Merge branch 'master' into sf/deepseek-ocr

9a05e1d

fixed merge build issue

4d91711

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

github-actions bot added examples ggml python Nvidia GPU model labels Dec 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mirror] mtmd: Add DeepSeekOCR Support #66

[Mirror] mtmd: Add DeepSeekOCR Support #66

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

coderabbitai bot Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Mirror] mtmd: Add DeepSeekOCR Support #66

Are you sure you want to change the base?

[Mirror] mtmd: Add DeepSeekOCR Support #66

Conversation

ngxson commented Dec 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 23, 2025 •

edited

Loading