Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 22, 2025

Mirror from upstream PR: ggml-org#18246

Summary by CodeRabbit

  • New Features

    • Added Q3_HIFI quantization type with improved quality and outlier handling
    • Added benchmarking tools for model performance testing
  • Documentation

    • Added comprehensive IMatrix reference guide
    • Added Q3_HIFI quantization analysis with performance recommendations
  • Tests

    • Added Q3_HIFI validation testing framework
  • Chores

    • Updated .gitignore for model artifacts and datasets
    • Added quantization tooling options

✏️ Tip: You can customize this high-level summary in your review settings.

Geoff Munn and others added 30 commits November 27, 2025 22:42
Refactor quantization logic to handle quant_weights for outlier selection and improve clarity in the quantization process.
This document provides a comprehensive comparison of three 3-bit quantization strategies: Q3_HIFI, Q3_K_S, and Q3_K_M. It includes technical specifications, performance benchmarks, and recommendations for production use.
This guide provides a comprehensive overview of importance matrix (imatrix) files, including their purpose, generation, usage during quantization, and best practices for effective implementation.
Added Q3_HIFI type with quantization function and placeholder for dot product implementation.
Updated the comparison of Q3 quantization formats, including detailed descriptions of Q3_HIFI (Pure and Hybrid), Q3_K_S, and Q3_K_M. Added performance benchmarks, recommendations, and updated conclusions based on file size, quality, speed, and memory usage.
Implemented NEON-optimized dequantization for Q3_HIFI format, processing values in blocks for efficiency.
Added AVX2-optimized dequantization function for Q3_HIFI.
@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Walkthrough

Introduces Q3_HIFI, a new 3-bit quantization type with 6–8 FP16 outliers per block, implemented across CPU (x86/ARM), GPU (CUDA, Metal, Vulkan, SYCL), quantization infrastructure, and model integration layers. Includes quantization/dequantization kernels, backend dispatch logic, test utilities, and reference documentation.

Changes

Cohort / File(s) Summary
Core Type Definitions
ggml/include/ggml.h, ggml/src/ggml-common.h
New enum value GGML_TYPE_Q3_HIFI and block structure block_q3_hifi with Q3_K-compatible layout plus 8 FP16 outlier fields; constants Q3_HIFI_BLOCK_SIZE and Q3_HIFI_OUTLIERS added.
Quantization Reference & Generic
ggml/src/ggml-quants.c, ggml/src/ggml-quants.h, ggml/src/ggml.c
Reference and generic quantization/dequantization paths for Q3_HIFI; type traits registration in main library.
CPU Backend x86 & ARM
ggml/src/ggml-cpu/quants.c, ggml/src/ggml-cpu/quants.h, ggml/src/ggml-cpu/arch/x86/quants.c, ggml/src/ggml-cpu/arch/arm/quants.c, ggml/src/ggml-cpu/ggml-cpu.c, ggml/src/ggml-cpu/ops.cpp
Optimized AVX2 and NEON kernels for Q3_HIFI dot product and dequantization; type trait dispatch updated.
CUDA Backend
ggml/src/ggml-cuda/common.cuh, ggml/src/ggml-cuda/convert.cu, ggml/src/ggml-cuda/dequantize.cuh, ggml/src/ggml-cuda/ggml-cuda.cu, ggml/src/ggml-cuda/mmq.cu, ggml/src/ggml-cuda/mmvq.cu, ggml/src/ggml-cuda/vecdotq.cuh
CUDA kernels for Q3_HIFI dequantization and vector dot products with outlier handling; pipeline dispatch updated across MMVQ and dequant paths.
Metal Backend
ggml/src/ggml-metal/ggml-metal-device.cpp, ggml/src/ggml-metal/ggml-metal-impl.h, ggml/src/ggml-metal/ggml-metal.metal
Metal kernels for Q3_HIFI dequantization, mat-vec, and mat-mat operations with FP32/FP16 variants; resource sizing constants and host name bindings.
Vulkan Backend
ggml/src/ggml-vulkan/ggml-vulkan.cpp, ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp, ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp, ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
Vulkan compute shaders for Q3_HIFI dequantization and mat-vec with outlier restoration; dispatcher backend support extended.
SYCL Backend
ggml/src/ggml-sycl/convert.cpp, ggml/src/ggml-sycl/dequantize.hpp, ggml/src/ggml-sycl/mmvq.cpp, ggml/src/ggml-sycl/vecdotq.hpp
SYCL implementations for Q3_HIFI dequantization, vector dot product, and matrix operations; dispatch logic integrated.
Library & Model Integration
include/llama.h, src/llama-model-loader.cpp, src/llama-quant.cpp, gguf-py/gguf/constants.py, convert_hf_to_gguf.py
Enum value LLAMA_FTYPE_MOSTLY_Q3_HIFI, model loader mapping, adaptive quantization logic for sensitive layers, and GGUF constant/type registration.
Tools & Utilities
tools/quantize/quantize.cpp, benchmark_speed_test.ps1
Q3_HIFI quantization option added to CLI tool; PowerShell benchmark script for multi-model speed testing.
Documentation & Tests
docs/quantization/Q3_HIFI.md, tests/test-q3-hifi.py, tests/test-q3-hifi.sh, tests/test-q3-hifi-text.txt, IMatrix_Guide.md
Comprehensive Q3_HIFI analysis document, Python/Bash test scripts for perplexity validation, test corpus, and imatrix workflow guide.
Configuration
.gitignore
Exclude model artifacts, datasets, and large files from version control.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The changes introduce a new quantization type across 40+ files with backend-specific optimizations (AVX2, NEON, CUDA, Metal, Vulkan, SYCL). While the additions follow existing patterns and maintain modularity, the scope spans multiple architectures with custom kernel implementations, outlier handling logic, and intricate bit-manipulation operations. CPU and GPU kernels require careful validation of correctness, and the adaptive quantization routing in llama-quant.cpp adds conditional logic across multiple code paths.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is minimal and provides only a reference link to the upstream PR without following the repository's PR template structure. Expand the description to include details about changes, rationale, testing approach, and any breaking changes, following the repository template guidelines.
Docstring Coverage ⚠️ Warning Docstring coverage is 19.74% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: introduction of a new Q3_HIFI quantization type.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (21)
.gitignore (1)

140-153: Simplify patterns and relocate to Models section.

The ignore patterns are overly specific and verbose. Additionally, they're placed outside the Models section where similar entries belong:

  1. Lines 140–142: List three specific .raw files. Instead, ignore the entire directory.
  2. Lines 143–153: List 11 individual files under Qwen3-1.7B/. Instead, use a single directory pattern.

These appear to be datasets and models used for benchmarking or testing the Q3_HIFI quantization format.

🔎 Proposed refactor to simplify patterns

Consolidate to the Models section (around line 68–74) and use directory-level patterns:

  !/models/.editorconfig
  !/models/ggml-vocab-*.gguf*
  !/models/templates
+
+# Q3_HIFI benchmark & test datasets
+/wikitext-2-raw/
+/Qwen3-1.7B/

Then remove lines 140–153 entirely.

Clarify whether these datasets are:

  • Generated dynamically at runtime (in which case they shouldn't be in .gitignore at all), or
  • Downloaded/populated by setup scripts (in which case they should be documented in the repository root or build instructions).
benchmark_speed_test.ps1 (3)

15-19: Consider parameterizing hardcoded paths.

The script uses hardcoded relative paths that assume execution from the repository root. This will fail if run from a different directory.

🔎 Suggested refactor to make paths configurable
 param(
     [int]$Iterations = 100,
     [int]$Threads = 4,
     [int]$Repeats = 3,
     [int]$PromptTokens = 0,
-    [int]$GenerateTokens = 20
+    [int]$GenerateTokens = 20,
+    [string]$LlamaBenchPath = ".\build\bin\Release\llama-bench.exe",
+    [string]$ModelsDir = "."
 )

 $ErrorActionPreference = "Stop"

 # Configuration
-$LlamaBench = ".\build\bin\Release\llama-bench.exe"
+$LlamaBench = $LlamaBenchPath
 $Models = @(
-    @{ Name = "Q3_K_S"; Path = ".\Qwen3-1.7B-f16-Q3_K_S.gguf" },
-    @{ Name = "Q3_K_M"; Path = ".\Qwen3-1.7B-f16-Q3_K_M.gguf" },
-    @{ Name = "Q3_HIFI"; Path = ".\Qwen3-1.7B-f16-Q3_HIFI.gguf" }
+    @{ Name = "Q3_K_S"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_S.gguf" },
+    @{ Name = "Q3_K_M"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_M.gguf" },
+    @{ Name = "Q3_HIFI"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_HIFI.gguf" }
 )

76-115: Consider validating parsed speed values.

The script parses speed values from llama-bench output but doesn't validate that the extracted values are positive and reasonable. Invalid or zero values could skew statistics.

🔎 Suggested validation
                 if ($lineStr -match "tg\d+\s*\|\s*([\d.]+)\s*±\s*([\d.]+)") {
                     $speed = [double]$Matches[1]
-                    [void]$Results[$model.Name].Speeds.Add($speed)
-                    $found = $true
-                    break
+                    if ($speed -gt 0) {
+                        [void]$Results[$model.Name].Speeds.Add($speed)
+                        $found = $true
+                        break
+                    }
                 }
                 # Alternative pattern: just numbers at end of line
                 elseif ($lineStr -match "\|\s*tg\d+\s*\|\s*([\d.]+)") {
                     $speed = [double]$Matches[1]
-                    [void]$Results[$model.Name].Speeds.Add($speed)
-                    $found = $true
-                    break
+                    if ($speed -gt 0) {
+                        [void]$Results[$model.Name].Speeds.Add($speed)
+                        $found = $true
+                        break
+                    }
                 }

180-243: Consider adding error summary to report.

Errors are counted and included in the CSV export, but there's no prominent summary in the console output. Users might miss that benchmarks had failures, leading to potentially misleading statistics.

🔎 Suggested addition

Add after line 188:

 Write-Host "Total iterations per model: $Iterations"
+
+# Error summary
+$TotalErrors = ($Results.Values | ForEach-Object { $_.Errors } | Measure-Object -Sum).Sum
+if ($TotalErrors -gt 0) {
+    Write-Host ""
+    Write-Host "WARNING: $TotalErrors total benchmark failures detected" -ForegroundColor Red
+    foreach ($model in $Models) {
+        $errors = $Results[$model.Name].Errors
+        if ($errors -gt 0) {
+            Write-Host "  - $($model.Name): $errors failed runs" -ForegroundColor Yellow
+        }
+    }
+}
 Write-Host ""
tests/test-q3-hifi.sh (1)

102-102: Dependency on bc may cause failures on minimal systems.

The bc utility may not be installed on all systems (e.g., minimal Docker containers, some CI environments). Consider adding a check or using an alternative.

🔎 Option 1: Add dependency check
+# Check for required commands
+if ! command -v bc &> /dev/null; then
+    echo "Error: 'bc' is required but not installed."
+    exit 1
+fi
+
 # Check if PPL is reasonable (less than threshold)
 if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then
🔎 Option 2: Use awk instead (more portable)
 # Check if PPL is reasonable (less than threshold)
-if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then
+if awk "BEGIN {exit !($PPL < $PPL_THRESHOLD)}"; then
     echo "✅ Test PASSED: PPL ($PPL) is below threshold ($PPL_THRESHOLD)"
ggml/src/ggml-sycl/dequantize.hpp (2)

383-388: Consider performance implications of per-element outlier scanning.

The outlier check loop runs for every element (4 elements per thread × 6 outlier checks = 24 comparisons). While functionally correct, this linear scan could be a bottleneck for GPU execution.

For a future optimization, consider:

  • Pre-sorting outlier indices to enable binary search
  • Using a bitmask or lookup table if Q3_HIFI_OUTLIERS is small and fixed

That said, with only 6 outliers per block and the break statement, this is likely acceptable.


415-418: Missing early break in else branch outlier loop.

Unlike the QK_K==256 branch (line 387), this branch doesn't use break after finding a match. While functionally correct (it just overwrites the same value), adding early termination would improve consistency and performance.

🔎 Proposed fix
     // Check for outliers
     int idx0 = 16*is + il;
     int idx1 = 16*is + il + 32;
+    bool found0 = false, found1 = false;
     for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-        if (x[i].outlier_idx[k] == idx0) val0 = x[i].outlier_vals[k];
-        if (x[i].outlier_idx[k] == idx1) val1 = x[i].outlier_vals[k];
+        if (!found0 && x[i].outlier_idx[k] == idx0) { val0 = x[i].outlier_vals[k]; found0 = true; }
+        if (!found1 && x[i].outlier_idx[k] == idx1) { val1 = x[i].outlier_vals[k]; found1 = true; }
+        if (found0 && found1) break;
     }
IMatrix_Guide.md (1)

345-348: Consider qualifying the perplexity improvement claim.

The statement "Typically 5-15% better perplexity" is a helpful ballpark figure, but perplexity improvements can vary significantly based on model architecture, calibration data quality, and other factors.

Consider adding a qualifier such as "results may vary based on model and calibration data" to set appropriate expectations.

ggml/src/ggml-cuda/vecdotq.cuh (1)

775-847: Defensively bound-check outlier_idx[k] before using it.

The Q3_HIFI outlier loop assumes every outlier_idx[k] is a valid index into the 256‑weight block. If any unused slots are left uninitialized or use a sentinel outside [0, QK_K), this can lead to undefined behavior when indexing bq8_1[idx_bq8] / qs[idx_in_bq8].

Given the loop runs for all k < Q3_HIFI_OUTLIERS regardless of how many real outliers exist, it would be safer to explicitly skip out-of-range indices:

Proposed defensive guard for outlier indices
-#pragma unroll
-    for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-        const int idx = bq3_hifi->outlier_idx[k];
+#pragma unroll
+    for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
+        const int idx = bq3_hifi->outlier_idx[k];
+        // Skip unused/sentinel entries and out-of-range indices
+        if (idx < 0 || idx >= QK_K) {
+            continue;
+        }

If the quantization path already guarantees that unused slots have in‑range indices and zero values, this is purely defensive, but it makes the assumptions explicit and hardens the kernel against future layout changes.

include/llama.h (1)

155-158: Public enum change: Q3_HIFI_OLD/UNIFORM removal is a source-level API break

Removing LLAMA_FTYPE_MOSTLY_Q3_HIFI_OLD (39) and LLAMA_FTYPE_MOSTLY_Q3_HIFI_UNIFORM (40) and only exposing LLAMA_FTYPE_MOSTLY_Q3_HIFI = 41 means any external code that referenced the old names will no longer compile, and older GGUF models using 39/40 will now be reported as “unknown” by llama_model_ftype_name().

Given include/llama.h is the public API surface, it’s worth double‑checking that:

  • Those legacy ftypes were never documented/stabilized, or
  • Downstream users are expected to migrate and this change is communicated in release notes.

If backward compatibility is a concern, one option is to keep the old enumerators (possibly marked as legacy in comments) and explicitly map them at load time to the new Q3_HIFI handling.

ggml/src/ggml.c (1)

735-742: Q3_HIFI type traits and quantize wiring look consistent with existing formats

The type_traits[GGML_TYPE_Q3_HIFI] entry and the new GGML_TYPE_Q3_HIFI branch in ggml_quantize_chunk are hooked up in the same way as other block formats (matching block type, block size macro, and *_ref quant/dequant helpers), so the core type system and CPU quantization path look correctly extended.

Minor, non‑blocking nit: most other type_name strings use lowercase (e.g. "q3_K"), while "Q3_HIFI" is uppercase; keep as‑is if it’s intentional branding, otherwise consider normalizing for consistency.

Based on learnings, treating this as upstream‑facing feedback rather than a strict change request for the mirror.

Also applies to: 7548-7549

ggml/src/ggml-cuda/dequantize.cuh (1)

79-130: Q3_HIFI CUDA dequantization matches Q3_K layout with sensible outlier handling

The Q3 bit unpacking (qs / hmask math, index ranges, and -4 offset) is consistent with the documented Q3_K layout, and the small unrolled loop over Q3_HIFI_OUTLIERS to restore FP16 outliers is straightforward and cheap on device. From a CUDA side this looks sound.

Two minor follow‑ups you may want to sanity‑check against the CPU reference:

  • Ensure block_q3_hifi’s fields (d, qs, hmask, outlier_idx, outlier_vals) and Q3_HIFI_OUTLIERS exactly match these assumptions (Q3_K‑compatible packing plus a fixed outlier tail).
  • The comments disagree on the outlier count (“6 FP16 outliers” vs “only 8 per 256 weights”); consider aligning them to whatever Q3_HIFI_OUTLIERS is actually set to, just to avoid confusion.
ggml/src/ggml-cpu/quants.c (2)

582-582: Inconsistent FP16 conversion macro.

This line uses GGML_FP16_TO_FP32 while other functions in this file (e.g., ggml_vec_dot_q3_K_q8_K_generic at line 549) use GGML_CPU_FP16_TO_FP32. Use the CPU-specific macro for consistency.

🔎 Suggested fix
-        const float d = GGML_FP16_TO_FP32(xb->d) * yb->d;
+        const float d = GGML_CPU_FP16_TO_FP32(xb->d) * yb->d;

631-638: Same FP16 conversion inconsistency in outlier corrections.

These lines also use GGML_FP16_TO_FP32 instead of GGML_CPU_FP16_TO_FP32. Apply the same fix throughout the function.

🔎 Suggested fix
-        total_sum += GGML_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd;
ggml/src/ggml-sycl/convert.cpp (1)

117-147: SYCL Q3_HIFI wiring matches existing Q3_K path; align the outlier-count comment.

The new dequantize_row_q3_hifi_sycl wrapper and its use in both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl mirror the existing q3_K implementation and look correct.

One minor nit: the comment at Line 117 says “Q3_HIFI: Q3_K-compatible layout with 6 FP16 outliers”, while the shared layout (via block_q3_hifi/Q3_HIFI_OUTLIERS) and Python constants describe Q3_HIFI in terms of the generic Q3_HIFI_OUTLIERS count (documented as 8 outliers in other places). To avoid confusion if the constant ever changes, consider rephrasing the comment to reference Q3_HIFI_OUTLIERS or to match the canonical outlier count used in the block definition.

Also applies to: 574-576, 640-642

ggml/src/ggml-cuda/convert.cu (1)

555-566: Consider distributing outlier restoration across threads.

Currently, only thread 0 handles all outliers while 63 threads idle after the barrier. For 6-8 outliers this overhead is minimal, but if Q3_HIFI_OUTLIERS increases, consider having multiple threads participate:

     // Synchronize before overwriting outliers
     __syncthreads();

-    // Thread 0 handles outlier restoration
-    if (threadIdx.x == 0) {
-        dst_t * yb = yy + i*QK_K;
-        #pragma unroll
-        for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-            const int idx = x[i].outlier_idx[k];
-            yb[idx] = __half2float(x[i].outlier_vals[k]);
-        }
-    }
+    // Distribute outlier restoration across threads
+    if (threadIdx.x < Q3_HIFI_OUTLIERS) {
+        dst_t * yb = yy + i*QK_K;
+        const int k = threadIdx.x;
+        const int idx = x[i].outlier_idx[k];
+        yb[idx] = __half2float(x[i].outlier_vals[k]);
+    }

This avoids thread divergence and utilizes available parallelism.

tests/test-q3-hifi.py (1)

140-145: Consider using logging.exception() in except blocks for better diagnostics.

Per static analysis hint, logging.exception() automatically includes the stack trace, which can aid debugging. However, for straightforward errors like these, the current approach is acceptable.

Optional improvement
     try:
         perplexity_exe = find_executable("llama-perplexity", build_dir)
     except FileNotFoundError as e:
-        logging.error("Error: %s", e)
+        logging.exception("Could not find perplexity executable")
         logging.info("Make sure you've built llama.cpp first.")
         return 1
ggml/src/ggml-cpu/arch/arm/quants.c (1)

4219-4219: Use symbolic constant instead of magic number 96.

The hardcoded constant 96 at line 4219 should be replaced with (Q3_HIFI_BLOCK_SIZE * 3 / 8) for clarity and maintainability. The value 96 represents the number of bytes needed to store 256 3-bit values.

🔎 Proposed refactor
-                if (bit_offset > 5 && byte_idx + 1 < 96) {
+                if (bit_offset > 5 && byte_idx + 1 < (Q3_HIFI_BLOCK_SIZE * 3 / 8)) {
                     bits |= (qs[byte_idx + 1] << (8 - bit_offset)) & 7;
                 }
ggml/src/ggml-quants.c (3)

1398-1401: Layout compatibility assumption lacks compile-time verification.

The cast from block_q3_hifi* to block_q3_K* relies on the comment at lines 1399-1400 stating "first 110 bytes are identical layout." Consider adding a compile-time assertion in the header to catch layout drift:

static_assert(offsetof(block_q3_hifi, hmask) == offsetof(block_q3_K, hmask) &&
              offsetof(block_q3_hifi, qs) == offsetof(block_q3_K, qs) &&
              offsetof(block_q3_hifi, scales) == offsetof(block_q3_K, scales) &&
              offsetof(block_q3_hifi, d) == offsetof(block_q3_K, d),
              "block_q3_hifi must be layout-compatible with block_q3_K");

5148-5151: Empty section - appears to be vestigial.

These lines add a comment block but no implementation follows. If this was intended as a placeholder for future code, consider removing it or adding the intended content.


5463-5466: Validation only checks d field, not outlier FP16 values.

The VALIDATE_ROW_DATA_D_F16_IMPL macro validates the block's d field, but block_q3_hifi also contains outlier_vals[Q3_HIFI_OUTLIERS] which are FP16 values that could contain NaN/Inf. Consider extending validation:

 case GGML_TYPE_Q3_HIFI:
     {
-        VALIDATE_ROW_DATA_D_F16_IMPL(block_q3_hifi, data, nb);
+        const block_q3_hifi * q = (const block_q3_hifi *) data;
+        for (size_t i = 0; i < nb; ++i) {
+            if (!validate_fp16(q[i].d, i)) {
+                return false;
+            }
+            for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
+                if (!validate_fp16(q[i].outlier_vals[k], i)) {
+                    return false;
+                }
+            }
+        }
     } break;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfc959b and e4fd98f.

⛔ Files ignored due to path filters (3)
  • ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glsl is excluded by !**/*.glsl
  • ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.glsl is excluded by !**/*.glsl
  • ggml/src/ggml-vulkan/vulkan-shaders/types.glsl is excluded by !**/*.glsl
📒 Files selected for processing (43)
  • .gitignore
  • IMatrix_Guide.md
  • benchmark_speed_test.ps1
  • convert_hf_to_gguf.py
  • docs/quantization/Q3_HIFI.md
  • ggml/include/ggml.h
  • ggml/src/ggml-common.h
  • ggml/src/ggml-cpu/arch/arm/quants.c
  • ggml/src/ggml-cpu/arch/x86/quants.c
  • ggml/src/ggml-cpu/ggml-cpu.c
  • ggml/src/ggml-cpu/ops.cpp
  • ggml/src/ggml-cpu/quants.c
  • ggml/src/ggml-cpu/quants.h
  • ggml/src/ggml-cuda/common.cuh
  • ggml/src/ggml-cuda/convert.cu
  • ggml/src/ggml-cuda/dequantize.cuh
  • ggml/src/ggml-cuda/ggml-cuda.cu
  • ggml/src/ggml-cuda/mmq.cu
  • ggml/src/ggml-cuda/mmvq.cu
  • ggml/src/ggml-cuda/vecdotq.cuh
  • ggml/src/ggml-metal/ggml-metal-device.cpp
  • ggml/src/ggml-metal/ggml-metal-impl.h
  • ggml/src/ggml-metal/ggml-metal.metal
  • ggml/src/ggml-quants.c
  • ggml/src/ggml-quants.h
  • ggml/src/ggml-sycl/convert.cpp
  • ggml/src/ggml-sycl/dequantize.hpp
  • ggml/src/ggml-sycl/mmvq.cpp
  • ggml/src/ggml-sycl/vecdotq.hpp
  • ggml/src/ggml-vulkan/ggml-vulkan.cpp
  • ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp
  • ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_iq1_s.comp
  • ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp
  • ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
  • ggml/src/ggml.c
  • gguf-py/gguf/constants.py
  • include/llama.h
  • src/llama-model-loader.cpp
  • src/llama-quant.cpp
  • tests/test-q3-hifi-text.txt
  • tests/test-q3-hifi.py
  • tests/test-q3-hifi.sh
  • tools/quantize/quantize.cpp
🧰 Additional context used
📓 Path-based instructions (6)
.gitignore

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Never commit build artifacts (build/, .ccache/, *.o, *.gguf) or model files; ensure proper .gitignore configuration

Files:

  • .gitignore
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • ggml/src/ggml-vulkan/ggml-vulkan.cpp
  • src/llama-model-loader.cpp
  • ggml/src/ggml-metal/ggml-metal-impl.h
  • include/llama.h
  • ggml/include/ggml.h
  • tools/quantize/quantize.cpp
  • ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
  • ggml/src/ggml-sycl/convert.cpp
  • ggml/src/ggml-cpu/ops.cpp
  • ggml/src/ggml-common.h
  • ggml/src/ggml-quants.h
  • src/llama-quant.cpp
  • ggml/src/ggml-cpu/quants.h
  • ggml/src/ggml-metal/ggml-metal-device.cpp
  • ggml/src/ggml-sycl/dequantize.hpp
  • ggml/src/ggml-sycl/mmvq.cpp
  • ggml/src/ggml-sycl/vecdotq.hpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • ggml/src/ggml-vulkan/ggml-vulkan.cpp
  • src/llama-model-loader.cpp
  • ggml/src/ggml-metal/ggml-metal-impl.h
  • include/llama.h
  • ggml/include/ggml.h
  • gguf-py/gguf/constants.py
  • tools/quantize/quantize.cpp
  • ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
  • ggml/src/ggml-sycl/convert.cpp
  • ggml/src/ggml-cpu/ops.cpp
  • ggml/src/ggml-common.h
  • ggml/src/ggml-quants.h
  • src/llama-quant.cpp
  • ggml/src/ggml-cpu/quants.h
  • tests/test-q3-hifi.py
  • ggml/src/ggml-metal/ggml-metal-device.cpp
  • ggml/src/ggml-sycl/dequantize.hpp
  • convert_hf_to_gguf.py
  • ggml/src/ggml-sycl/mmvq.cpp
  • ggml/src/ggml-sycl/vecdotq.hpp
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/llama-model-loader.cpp
  • src/llama-quant.cpp
include/llama.h

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Exercise careful consideration when making changes to the public API in include/llama.h, as API stability is critical

Files:

  • include/llama.h
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/constants.py
  • tests/test-q3-hifi.py
  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to .gitignore : Never commit build artifacts (build/, .ccache/, *.o, *.gguf) or model files; ensure proper .gitignore configuration

Applied to files:

  • .gitignore
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

  • src/llama-model-loader.cpp
  • include/llama.h
  • src/llama-quant.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

  • include/llama.h
🧬 Code graph analysis (9)
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (2)
common/common.cpp (2)
  • string_ends_with (459-461)
  • string_ends_with (459-459)
common/common.h (1)
  • string_starts_with (627-630)
ggml/src/ggml-sycl/convert.cpp (1)
ggml/src/ggml-sycl/dequantize.hpp (3)
  • vx (18-18)
  • dequantize_block_q3_hifi (350-423)
  • dequantize_block_q3_hifi (350-351)
ggml/src/ggml-quants.h (2)
ggml/src/ggml-quants.c (3)
  • quantize_row_q3_hifi_ref (1281-1333)
  • dequantize_row_q3_hifi (1390-1409)
  • quantize_q3_hifi (1411-1424)
ggml/src/ggml-cpu/arch/arm/quants.c (1)
  • dequantize_row_q3_hifi (4197-4256)
ggml/src/ggml-cuda/mmvq.cu (1)
ggml/src/ggml-sycl/vecdotq.hpp (1)
  • vec_dot_q3_hifi_q8_1 (805-806)
ggml/src/ggml-cpu/quants.h (3)
ggml/src/ggml-cpu/quants.c (2)
  • quantize_row_q3_hifi (69-73)
  • ggml_vec_dot_q3_hifi_q8_K_generic (558-642)
ggml/src/ggml-cpu/arch/x86/quants.c (1)
  • ggml_vec_dot_q3_hifi_q8_K (2336-2485)
ggml/src/ggml-cpu/arch/arm/quants.c (1)
  • ggml_vec_dot_q3_hifi_q8_K (2049-2187)
ggml/src/ggml-cpu/arch/arm/quants.c (1)
ggml/src/ggml-cpu/quants.c (1)
  • ggml_vec_dot_q3_hifi_q8_K_generic (558-642)
ggml/src/ggml-quants.c (1)
ggml/src/ggml.c (1)
  • ggml_row_size (1281-1284)
ggml/src/ggml-cpu/quants.c (1)
ggml/src/ggml-quants.c (1)
  • quantize_row_q3_hifi_ref (1281-1333)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
  • LlamaFileType (3233-3275)
🪛 LanguageTool
IMatrix_Guide.md

[style] ~361-~361: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...--no-ppl` ### Problem: IMatrix file is very large Solutions: - This is normal (can b...

(EN_WEAK_ADJECTIVE)

tests/test-q3-hifi-text.txt

[style] ~26-~26: Consider using a synonym to be more concise.
Context: ...startups have made significant progress in recent years. In 2019, Google announced quantum supr...

(IN_RECENT_STYLE)

🪛 markdownlint-cli2 (0.18.1)
docs/quantization/Q3_HIFI.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
tests/test-q3-hifi.py

96-96: Avoid specifying long messages outside the exception class

(TRY003)


102-102: subprocess call: check for execution of untrusted input

(S603)


122-122: Avoid specifying long messages outside the exception class

(TRY003)


143-143: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


181-181: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (56)
benchmark_speed_test.ps1 (2)

22-33: LGTM!

Good defensive validation to fail fast if required files are missing.


268-296: LGTM!

Export logic properly handles both summary statistics (CSV) and raw data (JSON) with timestamped filenames to prevent overwrites.

ggml/src/ggml-metal/ggml-metal-impl.h (1)

35-37: LGTM!

The new constants N_R0_Q3_HIFI and N_SG_Q3_HIFI are correctly positioned after the Q3_K definitions and follow the existing naming conventions. The values (2, 2) match Q3_K, which makes sense given Q3_HIFI is described as Q3_K-compatible.

ggml/include/ggml.h (2)

379-381: Clear documentation for the new type.

The comments clearly document that Q3_HIFI uses a Q3_K-compatible layout with 6 FP16 outliers, and that the block structure is defined in ggml-common.h for GPU backend compatibility. This is helpful for developers working with this format.


428-429: Correctly added at the end of the enum.

The new GGML_TYPE_Q3_HIFI = 40 is properly added at the end of the enum as recommended by the comment on line 386, maintaining backward compatibility. GGML_TYPE_COUNT is correctly incremented to 41.

Note: The ggml_ftype enum (lines 439-465) doesn't have a corresponding GGML_FTYPE_MOSTLY_Q3_HIFI entry. If Q3_HIFI models are intended to be saved as GGUF files with a file type indicator, you may need to add this entry. However, this might be handled elsewhere (e.g., in llama.cpp's file type mappings).

ggml/src/ggml-sycl/dequantize.hpp (1)

348-423: LGTM - Q3_HIFI dequantization implementation.

The implementation correctly follows the Q3_K dequantization pattern and adds outlier restoration logic. The two code paths for QK_K == 256 and other values are consistent with the existing Q3_K implementation.

IMatrix_Guide.md (1)

1-425: Excellent documentation for imatrix workflow.

This is a well-structured and comprehensive guide covering the complete imatrix workflow. The sections on Q3_HIFI integration (lines 324-349) are particularly valuable for users of the new quantization format.

ggml/src/ggml-cuda/mmq.cu (1)

254-258: Comment-only clarification about Q3_HIFI routing looks correct.

Annotating that Q3_HIFI is excluded from the MMQ path and handled via MMVQ/dequant keeps the control-flow intent clear without changing behavior.

ggml/src/ggml-cuda/ggml-cuda.cu (1)

4375-4401: Q3_HIFI enablement for CUDA matmul is consistent with other K-quant types.

Including GGML_TYPE_Q3_HIFI in the supported MUL_MAT / MUL_MAT_ID weight types matches how other Q3/Q4_K variants are handled and aligns with the dedicated MMVQ/dequant path for Q3_HIFI.

Please make sure end‑to‑end CUDA tests cover Q3_HIFI matmuls (both plain and _ID) so this wiring is exercised across shapes/backends.

tools/quantize/quantize.cpp (1)

46-47: New Q3_HIFI quant option wiring looks correct.

The QUANT_OPTIONS entry for "Q3_HIFI" cleanly maps to LLAMA_FTYPE_MOSTLY_Q3_HIFI and matches the descriptive intent of an adaptive Q3_HIFI/Q3_K/Q4_K mix. Just keep the bpw/behavior string aligned with the Q3_HIFI documentation as upstream numbers evolve.

ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (1)

45-70: Q3_HIFI wiring into Vulkan shader generator matches existing per-type patterns

Adding "q3_hifi" to type_names and routing it through the specialized mul_mat_vec_q3_hifi.comp path mirrors how _k and iq* types are handled. It should participate correctly in dequant, matmul, and arr_dmmv_* generation without affecting existing types.

Also applies to: 669-679

ggml/src/ggml-metal/ggml-metal-device.cpp (1)

637-646: Metal mul_mv/mul_mv_id correctly extend dispatch to Q3_HIFI

The new GGML_TYPE_Q3_HIFI cases mirror the existing GGML_TYPE_Q3_K mapping (setting nsg/nr0 via N_SG_Q3_HIFI/N_R0_Q3_HIFI) and leave smem at the default, which matches nearby patterns. As long as these constants and kernels are defined consistently in the Metal backend, this wiring looks correct.

Also applies to: 855-863

src/llama-model-loader.cpp (1)

29-64: Loader/display integration for LLAMA_FTYPE_MOSTLY_Q3_HIFI is coherent

The new LLAMA_FTYPE_MOSTLY_Q3_HIFI case in both llama_model_ftype_name() and the type_maxftype switch keeps the guessed file type and the human‑readable label in sync with GGML_TYPE_Q3_HIFI. Since GGUF’s general.file_type can still override the guess, this is a safe, backwards‑compatible extension.

Also applies to: 641-672

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_iq1_s.comp (1)

10-51: IQ1_S Vulkan mat‑vec shader changes are cosmetic/optimization-only

The added [[unroll]] attributes and minor reshuffling of temporaries in calc_superblock() preserve the original math (same bit patterns, fma/dot usage, and accumulation into temp[j][n]). Functionally this should be a no-op, with potential small perf wins depending on the driver.

ggml/src/ggml-sycl/mmvq.cpp (1)

718-739: *SYCL MMVQ: Q3_HIFI mat‑vec path is consistent with existing _K implementations

mul_mat_vec_q3_hifi_q8_1_sycl() reuses the generic mul_mat_vec_q with QK_K/QI3_K, a dedicated block_q3_hifi, and vec_dot_q3_hifi_q8_1, and the launcher mirrors the other *_K variants. The dispatcher’s new GGML_TYPE_Q3_HIFI case correctly routes Q3_HIFI tensors here, so the SYCL backend should now handle this quantization type end‑to‑end.

Also applies to: 1099-1101

ggml/src/ggml-cuda/common.cuh (1)

828-833: Q3_HIFI CUDA traits correctly mirror Q3_K geometry

ggml_cuda_type_traits<GGML_TYPE_Q3_HIFI> reuses QK_K/QR3_K/QI3_K, which keeps the CUDA block geometry aligned with existing Q3_K kernels and matches the “Q3_K‑compatible” layout intent. No issues from this specialization alone.

ggml/src/ggml-cpu/ggml-cpu.c (1)

282-287: LGTM! Q3_HIFI type trait correctly registered.

The new type trait entry for GGML_TYPE_Q3_HIFI follows the established pattern and correctly references the quantization and vector dot-product functions. The configuration (using GGML_TYPE_Q8_K as vec_dot_type and nrows = 1) is consistent with the similar Q3_K type.

ggml/src/ggml-cuda/mmvq.cu (3)

20-20: LGTM! Q3_HIFI vector dot dispatch added correctly.

The dispatch for vec_dot_q3_hifi_q8_1 is properly placed and follows the established pattern for other quantization types.


47-47: Verify that Q3_HIFI can safely reuse Q3_K's VDR constant.

Line 47 returns VDR_Q3_K_Q8_1_MMVQ for GGML_TYPE_Q3_HIFI with a comment indicating it's the same as Q3_K. While this may be intentional, please confirm that Q3_HIFI's block structure and quantization layout are compatible with Q3_K's vector dot reduction parameter. If they differ structurally (e.g., different numbers of outliers or quant layouts), a dedicated VDR_Q3_HIFI_Q8_1_MMVQ constant might be needed.

Based on learnings, please note: This is a mirrored PR, so any issues should be reported to ngxson to relay to the upstream contributor.


529-534: LGTM! Q3_HIFI switch case properly implemented.

The switch case for GGML_TYPE_Q3_HIFI correctly follows the established pattern and will properly dispatch to the Q3_HIFI-specific matrix-vector multiplication path.

ggml/src/ggml-cpu/ops.cpp (7)

667-692: Add: Q3_HIFI wired into quantized add path correctly

Including GGML_TYPE_Q3_HIFI in the quantized branch reuses ggml_compute_forward_add_q_f32, aligned with other Q* and IQ* formats; no issues spotted.


1116-1142: Add1: Q3_HIFI shares the existing quantized scalar-add implementation

GGML_TYPE_Q3_HIFI is dispatched to ggml_compute_forward_add1_q_f32 alongside other quantized types, which is consistent with the existing design.


1244-1273: Acc: Q3_HIFI explicitly marked unsupported like other quant/low-prec types

Adding GGML_TYPE_Q3_HIFI into the grouped cases that end in GGML_ABORT("fatal error") keeps behavior consistent with other non‑F32 types for ggml_compute_forward_acc.


4269-4295: Out_prod: Q3_HIFI routed through quantized out_prod_q_f32

Dispatching GGML_TYPE_Q3_HIFI to ggml_compute_forward_out_prod_q_f32 mirrors handling of Q3_K and other quantized types and relies on the new to_float trait; looks correct.


4544-4572: Set: Q3_HIFI added to unsupported quantized/set group

GGML_TYPE_Q3_HIFI is listed among the quantized types that currently cause a fatal abort for ggml_compute_forward_set, matching existing behavior for other q-types.


4767-4794: Get_rows: Q3_HIFI enabled for quantized get_rows_q dequantization

Including GGML_TYPE_Q3_HIFI in the quantized branch so it uses ggml_compute_forward_get_rows_q is consistent with the rest of the quantized formats.


5493-5527: Clamp: Q3_HIFI explicitly treated as unsupported in clamp switch

Adding GGML_TYPE_Q3_HIFI to the clamp switch’s quantized/unsupported group ensures it’s rejected the same way as other non‑F32/F16 types.

tests/test-q3-hifi-text.txt (1)

1-46: LGTM - Good test data variety.

The combination of simple narrative (children's story) and complex technical content (quantum computing exposition) provides a good range of text complexity for perplexity validation testing of the Q3_HIFI quantization format.

ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp (1)

41-56: LGTM - Outlier handling is GPU-appropriate.

The fully unrolled outlier check without early-exit is the correct approach for GPU shaders to avoid divergent control flow. The Q3_K dequantization logic followed by outlier substitution matches the CPU implementation pattern.

ggml/src/ggml-cpu/quants.c (2)

69-73: LGTM!

The quantize_row_q3_hifi wrapper follows the established pattern of other quantization functions in this file, correctly delegating to the reference implementation after validating block alignment.


556-564: LGTM - Function structure follows established patterns.

The generic Q3_HIFI dot-product implementation correctly follows the Q3_K pattern with appropriate assertions, setup, and the addition of outlier corrections. The unrolled outlier handling is a reasonable performance choice for the generic fallback.

ggml/src/ggml-sycl/vecdotq.hpp (1)

801-855: Outlier-aware Q3_HIFI dot path looks consistent; please double-check the per-thread outlier gating.

The structure of vec_dot_q3_hifi_q8_1 is good: it reuses the Q3_K bulk kernel and then adds an explicit outlier correction term based on outlier_idx/outlier_vals, which is exactly what we want for Q3_HIFI.

The only part that’s hard to validate locally is the per-thread gating logic for outliers:

const int idx_bq8 = idx / QK8_1;
const int idx_in_bq8 = idx % QK8_1;
...
if (idx_bq8 >= bq8_offset && idx_bq8 < bq8_offset + QR3_K) {
    const int thread_q8_offset = iqs % QI8_1;
    const int pos_in_q8_group = idx_in_bq8 / 4;
    if (pos_in_q8_group == thread_q8_offset) {
        ...
        sum += outlier_val * q8_val * d8_val;
    }
}

Given how iqs, QR3_K, QI3_K, QK8_1, and QI8_1 interact in the MMVQ kernels, it would be good to confirm that:

  • Every valid outlier index is visited by exactly one (block, iqs) thread.
  • No outlier is silently skipped because pos_in_q8_group / thread_q8_offset don’t line up with the actual packing in bq8_1[*].qs.

You can verify this by diffing this function against the upstream CUDA vec_dot_q3_hifi_q8_1 implementation and by running the Q3_HIFI tests/benchmarks from the upstream PR (including any SYCL-specific tests) to ensure dot products match the reference CPU path across a range of shapes and random seeds. If there is any discrepancy, the bug is likely in the pos_in_q8_group vs thread_q8_offset mapping.

gguf-py/gguf/constants.py (1)

3220-3220: Q3_HIFI enum and size wiring look consistent; keep them in sync with ggml headers.

The additions

  • GGMLQuantizationType.Q3_HIFI = 41
  • LlamaFileType.MOSTLY_Q3_HIFI = 41
  • GGML_QUANT_SIZES[Q3_HIFI] = (256, 134) # Q3_K (110 bytes) + outlier_idx[8] + outlier_vals[16]

are internally consistent and match the documented Q3_HIFI layout (Q3_K base + 8 outliers).

The only follow-up I’d suggest is to double-check that:

  • GGML_TYPE_Q3_HIFI in ggml.h also uses id 41, and
  • block_q3_hifi’s sizeof matches 134 bytes,

to avoid silent GGUF/C-side drift when upstream tweaks the layout.

You can confirm this quickly by checking the upstream ggml.h and ggml-common.h for the GGML_TYPE_Q3_HIFI enum value and the definition of block_q3_hifi, and ensuring the size expression evaluates to 134 for QK_K = 256. Based on learnings, please relay any mismatch upstream via ngxson.

Also applies to: 3272-3274, 3370-3371

ggml/src/ggml-quants.h (1)

33-34: Q3_HIFI quant/dequant API surface is consistent with existing K-quants.

The new declarations for

  • quantize_row_q3_hifi_ref(...)
  • dequantize_row_q3_hifi(...)
  • quantize_q3_hifi(...)

are shaped and placed consistently with the existing K-quant APIs (q3_K, q4_K, etc.), which keeps the CPU backend and quantization dispatcher uniform.

As long as the implementations in ggml-quants.c and the arch-specific backends (AVX2, NEON, etc.) match these signatures, this header change looks good.

If you haven’t already, it’s worth grepping for quantize_row_q3_hifi_ref, dequantize_row_q3_hifi, and quantize_q3_hifi across the repo to ensure:

  • All declarations and definitions agree on parameter types and order.
  • No backend (e.g. CUDA/Metal/SYCL callers) still assumes only the older Q3_K entry points.

Also applies to: 106-108

ggml/src/ggml-cuda/convert.cu (2)

569-573: LGTM!

The launcher follows the established Q3_K pattern correctly.


732-733: LGTM!

Dispatch integration follows the established pattern and is correctly placed after Q3_K.

Also applies to: 785-786

ggml/src/ggml-cpu/arch/x86/quants.c (4)

2334-2351: LGTM!

Function signature and setup follow the established Q3_K pattern correctly.


2353-2455: LGTM!

The AVX2 implementation correctly mirrors the Q3_K kernel while properly handling the ggml_fp16_t scale field specific to Q3_HIFI.


2481-2484: LGTM!

The fallback to the generic implementation follows the established pattern.


3974-3975: The comment references a non-existent function.

The function dequantize_row_q3_hifi does not exist. The correct function name for Q3_K dequantization is dequantize_row_q3_K, not dequantize_row_q3_hifi. Update the comment to reference the correct function name.

Likely an incorrect or invalid review comment.

ggml/src/ggml-common.h (1)

291-306: Well-designed struct with Q3_K compatibility.

The approach of keeping the first 110 bytes identical to block_q3_K and appending the outlier extension is sound for kernel reuse. The compile-time static_assert provides good validation of the expected layout. The "DO NOT REORDER" comment is helpful for future maintainers.

tests/test-q3-hifi.py (1)

161-175: Sensible test configuration for quick validation.

Using -c 256 and --chunks 2 keeps the test fast for CI while still validating that the Q3_HIFI format works correctly. Combining stdout/stderr ensures PPL extraction works regardless of where the output goes.

ggml/src/ggml-cpu/arch/arm/quants.c (2)

2158-2173: Update comment to reflect 8 outliers, not 6.

Line 2158 incorrectly states "fully unrolled for 6 outliers" while the code processes 8 (indices 0-7 at lines 2166-2173), matching the Q3_HIFI_OUTLIERS constant. Also verify the macro inconsistency: the main loop uses GGML_CPU_FP16_TO_FP32 (line 2086) while outlier correction uses GGML_FP16_TO_FP32 (lines 2166-2173)—both within the same ARM NEON function, suggesting a potential copy-paste error.


4195-4257: The ARM NEON implementation already has fallback coverage. The generic implementation in ggml/src/ggml-quants.c is compiled into the main library and automatically available on all architectures, including non-NEON ARM systems. The absence of an explicit #else clause in the ARM-specific file is not problematic for linking. However, adding a comment clarifying that the generic implementation provides the fallback would improve code clarity.

ggml/src/ggml-quants.c (3)

1281-1333: LGTM - Reference quantization implementation is sound.

The outlier selection using magnitude-based ranking and masking with -1.0f works correctly since fabsf() ensures non-negative values. The block-level processing and field copying are appropriate.


1371-1373: Bulk quantization uses _ref instead of _impl - verify intentional.

The weighted magnitude at line 1347 affects outlier selection, but quantize_row_q3_K_ref (not _impl) is called for bulk quantization. This means quant_weights don't influence the Q3_K portion. If weights should affect bulk quantization quality, consider:

-        quantize_row_q3_K_ref(tmp, &q3k_block, Q3_HIFI_BLOCK_SIZE);
+        // Note: weights only affect outlier selection, not bulk Q3_K quantization
+        quantize_row_q3_K_ref(tmp, &q3k_block, Q3_HIFI_BLOCK_SIZE);

Or use quantize_row_q3_K_impl if bulk should be weight-aware.


1411-1424: LGTM - Entry point follows established patterns.

The routing logic and row-by-row processing match other quantize_q*_K functions in the file.

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp (3)

24-31: Barrier placement in non-uniform execution path.

The barrier at line 27 is inside a conditional block (if (!all_threads)). When all_threads is false, some invocations may skip the barrier while others execute it, which is undefined behavior in GLSL. However, looking at the logic:

  • When !all_threads, barrier is at line 27
  • When all_threads, barrier is at line 48

This appears correct since the condition is uniform across the workgroup (determined by loop bounds). The early continue at line 30 is after the barrier, which is safe.

Please verify that all_threads is indeed uniform across all invocations in the workgroup, as non-uniform barrier execution would cause undefined behavior.


81-122: LGTM - Standard mat-vec computation structure.

The work partitioning between full iterations (all_threads=true) and partial remainder (all_threads=false) correctly handles non-power-of-2 block counts.


124-135: LGTM - Boundary handling is correct.

The shader properly handles partial workgroups at the tensor boundary by computing the remaining rows when first_row + NUM_ROWS > p.stride_d.

ggml/src/ggml-vulkan/ggml-vulkan.cpp (3)

3583-3583: LGTM!

The pipeline creation for Q3_HIFI mul_mat_vec operations follows the established K-quant pattern correctly, using the appropriate reduc16 variants and workgroup configurations consistent with Q3_K.

Also applies to: 3608-3608


3705-3705: LGTM!

The dequantization and get_rows pipeline registrations for Q3_HIFI are correctly added with appropriate workgroup sizes matching the K-quant pattern. The code references the expected shader data arrays (dequant_q3_hifi_*, get_rows_q3_hifi_*) which should be generated from the corresponding shader files.

Also applies to: 3731-3731, 3757-3757


5407-5407: LGTM!

All switch statement additions correctly include GGML_TYPE_Q3_HIFI in the fallthrough blocks alongside other K-quant types. This ensures proper operator support validation and pipeline selection for the new quantization type. The ordering is consistent (always placed after Q3_K) across all locations.

Also applies to: 5479-5479, 5543-5543, 5634-5634, 5701-5701, 13850-13850, 13971-13971

ggml/src/ggml-metal/ggml-metal.metal (3)

9693-9693: Q3_HIFI get_rows wiring looks consistent but depends on dequantize_q3_hifi correctness

Registering kernel_get_rows_q3_hifi as kernel_get_rows_q<block_q3_hifi, QK_NL, dequantize_q3_hifi> mirrors the existing Q3_K path. Once dequantize_q3_hifi is fixed/confirmed, this instantiation should behave correctly and does not need further changes.


9756-9757: kernel_mul_mm_q3_hifi_{f32,f16} instantiations correctly mirror Q3_K path

The new mat‑mat kernels for Q3_HIFI (kernel_mul_mm_q3_hifi_f32 and _f16) hook into the generic kernel_mul_mm template using block_q3_hifi and dequantize_q3_hifi, consistent with the existing Q3_K registrations. Provided dequantize_q3_hifi matches the reference semantics, these registrations look good.

Also applies to: 9783-9784


9816-9817: Q3_HIFI indirect matmul/matvec registrations look fine but rely on core kernel correctness

  • kernel_mul_mm_id_q3_hifi_{f32,f16} reuse the same mul_mm_id template as other Q3_* types, with block_q3_hifi and dequantize_q3_hifi.
  • kernel_mul_mv_id_q3_hifi_f32 wires kernel_mul_mv_q3_hifi_f32_impl<N_R0_Q3_HIFI> through the existing mmv_fn indirection.

These are mechanically consistent with the other quant types. Once the base dequantize_q3_hifi and kernel_mul_mv_q3_hifi_f32_impl semantics are confirmed/fixed, these indirect paths should work as expected.

Also applies to: 9843-9844, 9999-10000

Comment on lines +196 to +221
# Find the fastest model for comparison
$FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum

# Detailed results table
Write-Host "SPEED COMPARISON (tokens/second - higher is better)" -ForegroundColor Yellow
Write-Host "-" * 70

$TableHeader = "{0,-15} {1,10} {2,10} {3,10} {4,10} {5,10} {6,10}" -f "Model", "Mean", "StdDev", "Median", "Min", "Max", "vs Best"
Write-Host $TableHeader -ForegroundColor White
Write-Host "-" * 70

foreach ($model in $Models) {
$stats = $AllStats[$model.Name]
$vsBest = if ($stats.Mean -eq $FastestMean) { "FASTEST" } else {
"-" + [math]::Round((1 - $stats.Mean / $FastestMean) * 100, 1) + "%"
}

$row = "{0,-15} {1,10:F2} {2,10:F2} {3,10:F2} {4,10:F2} {5,10:F2} {6,10}" -f `
$model.Name, $stats.Mean, $stats.StdDev, $stats.Median, $stats.Min, $stats.Max, $vsBest

if ($stats.Mean -eq $FastestMean) {
Write-Host $row -ForegroundColor Green
} else {
Write-Host $row
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix potential division by zero.

If all benchmark runs fail (no speeds collected), $FastestMean will be 0, causing a division-by-zero error on Line 210. This will crash the script instead of showing that all benchmarks failed.

🔎 Proposed fix
 # Find the fastest model for comparison
-$FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum
+$FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum
+if ($FastestMean -eq 0) {
+    Write-Warning "All benchmarks failed - no valid speed data collected"
+    $FastestMean = 1  # Prevent division by zero
+}

 # Detailed results table
 Write-Host "SPEED COMPARISON (tokens/second - higher is better)" -ForegroundColor Yellow

Comment on lines +10425 to 10427
"--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "q3_hifi", "auto"], default="f16",
help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, q3_hifi for Q3_HIFI (3-bit with outliers), and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

--outtype q3_hifi currently leads to a runtime error (no quant mapping for MOSTLY_Q3_HIFI).

You’ve exposed q3_hifi in the CLI and ftype_map, but prepare_tensors() never maps gguf.LlamaFileType.MOSTLY_Q3_HIFI to a gguf.GGMLQuantizationType. With the current code, running:

python convert_hf_to_gguf.py --outtype q3_hifi ...

will hit the else: raise ValueError(f"Unknown file type: {self.ftype.name}") branch in prepare_tensors() when data_qtype is still False.

You probably want to wire Q3_HIFI through the quantization mapping so the new outtype is usable.

Proposed fix: map MOSTLY_Q3_HIFI to GGMLQuantizationType.Q3_HIFI

Add a case alongside the other file types in ModelBase.prepare_tensors:

                 if isinstance(data_qtype, bool):
                     if self.ftype == gguf.LlamaFileType.ALL_F32:
                         data_qtype = gguf.GGMLQuantizationType.F32
                     elif self.ftype == gguf.LlamaFileType.MOSTLY_F16:
                         data_qtype = gguf.GGMLQuantizationType.F16
                     elif self.ftype == gguf.LlamaFileType.MOSTLY_BF16:
                         data_qtype = gguf.GGMLQuantizationType.BF16
                     elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0:
                         data_qtype = gguf.GGMLQuantizationType.Q8_0
+                    elif self.ftype == gguf.LlamaFileType.MOSTLY_Q3_HIFI:
+                        data_qtype = gguf.GGMLQuantizationType.Q3_HIFI
                     elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ1_0:
                         data_qtype = gguf.GGMLQuantizationType.TQ1_0
                     elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ2_0:
                         data_qtype = gguf.GGMLQuantizationType.TQ2_0
                     else:
                         raise ValueError(f"Unknown file type: {self.ftype.name}")

(Adjust the enum name if the quantization type is spelled differently in gguf.GGMLQuantizationType.)

Since this repo mirrors upstream, you may want to pass this concern along to the upstream PR author as well. Based on learnings, this should be relayed via ngxson to the original contributor.

Also applies to: 10583-10592

🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 10425-10427 (and similarly 10583-10592),
the new CLI outtype "q3_hifi" is exposed but prepare_tensors() never maps
gguf.LlamaFileType.MOSTLY_Q3_HIFI to a GGML quantization type, causing the
"Unknown file type" error; update the file-type -> quantization mapping in
ModelBase.prepare_tensors to include a branch that maps
gguf.LlamaFileType.MOSTLY_Q3_HIFI to gguf.GGMLQuantizationType.Q3_HIFI (or the
exact enum name used by gguf), mirroring the pattern used for the other file
types so q3_hifi runs without raising the ValueError.

Comment on lines +64 to +73
```
Perplexity Improvement (Q3_HIFI vs Q3_K_M)
═══════════════════════════════════════════════════════
0.6B ████████████████████████████████████ -16.4%
1.7B ██████████████████████████████████████████ -21.4%
4B ██████████████████ -7.3%
8B ███████████ -4.4%
14B ████ -1.6%
32B █████ -2.0%
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the fenced code block (MD040).

The fenced block starting at line 64 uses without a language, triggering markdownlint MD040. Consider using something like `text (or ```none `) to satisfy tooling and keep rendering consistent.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/quantization/Q3_HIFI.md around lines 64 to 73, the fenced code block
lacks a language tag which triggers markdownlint MD040; update the opening fence
to include a neutral language (e.g., "text" or "none") so the block is
explicitly marked as plain text, ensuring the linter passes and rendering stays
consistent.

Comment on lines +302 to +304
// === OUTLIER EXTENSION (18 bytes) ===
uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 6 bytes: outlier positions (0-255)
ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comments have incorrect byte sizes for outlier extension.

The comments don't match the actual sizes based on Q3_HIFI_OUTLIERS = 8:

  • Line 302: Comment says "18 bytes" but extension is actually 24 bytes (8 + 16)
  • Line 303: Comment says "6 bytes" but outlier_idx[8] is 8 bytes
  • Line 304: Comment says "12 bytes" but outlier_vals[8] × 2 bytes = 16 bytes

The code and static_assert are correct; only the comments need updating.

Suggested fix
-    // === OUTLIER EXTENSION (18 bytes) ===
-    uint8_t outlier_idx[Q3_HIFI_OUTLIERS];   // 6 bytes: outlier positions (0-255)
-    ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values
+    // === OUTLIER EXTENSION (24 bytes) ===
+    uint8_t outlier_idx[Q3_HIFI_OUTLIERS];   // 8 bytes: outlier positions (0-255)
+    ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 16 bytes: FP16 outlier values
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// === OUTLIER EXTENSION (18 bytes) ===
uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 6 bytes: outlier positions (0-255)
ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values
// === OUTLIER EXTENSION (24 bytes) ===
uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 8 bytes: outlier positions (0-255)
ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 16 bytes: FP16 outlier values
🤖 Prompt for AI Agents
In ggml/src/ggml-common.h around lines 302 to 304, the inline comments for the
outlier extension show incorrect byte counts; update them to match
Q3_HIFI_OUTLIERS = 8 so the extension is 24 bytes, outlier_idx[Q3_HIFI_OUTLIERS]
is 8 bytes, and outlier_vals[Q3_HIFI_OUTLIERS] (ggml_half) is 16 bytes; leave
the code and static_assert untouched and only change the comment texts to the
correct sizes.

Comment on lines +2457 to +2477
// Q3_HIFI: Add outlier corrections
// Fully unrolled loop for 6 outliers - eliminates loop overhead
// Note: We tried branchless masking but the computation cost outweighs
// any branch misprediction savings for only 6 outliers per block.
for (int i = 0; i < nb; ++i) {
const float d_y = y[i].d;
const int8_t * GGML_RESTRICT q8 = y[i].qs;
const uint8_t * GGML_RESTRICT idx = x[i].outlier_idx;
const ggml_fp16_t * GGML_RESTRICT vals = x[i].outlier_vals;

// Unrolled: process all 8 outliers without loop overhead
// Using FMA-friendly pattern: accumulate (w * a) * d_y
sumf += GGML_FP16_TO_FP32(vals[0]) * (float)q8[idx[0]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[1]) * (float)q8[idx[1]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[2]) * (float)q8[idx[2]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[3]) * (float)q8[idx[3]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[4]) * (float)q8[idx[4]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[5]) * (float)q8[idx[5]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[6]) * (float)q8[idx[6]] * d_y;
sumf += GGML_FP16_TO_FP32(vals[7]) * (float)q8[idx[7]] * d_y;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for Q3_HIFI_OUTLIERS constant definition
rg -n 'Q3_HIFI_OUTLIERS' --type c --type h -C2

Repository: ngxson/llama.cpp

Length of output: 5216


🏁 Script executed:

#!/bin/bash
# Search for outlier_idx and outlier_vals declarations to check array sizes
rg -n 'outlier_idx|outlier_vals' --type c --type h -B2 -A2

Repository: ngxson/llama.cpp

Length of output: 5851


🏁 Script executed:

#!/bin/bash
# Search for Q3_HIFI struct definition
rg -n 'struct.*Q3_HIFI|{.*outlier_idx.*outlier_vals' --type c --type h -C5

Repository: ngxson/llama.cpp

Length of output: 195


🏁 Script executed:

#!/bin/bash
# Get the exact lines around 2457-2477 from the x86 quants.c file
sed -n '2457,2477p' ggml/src/ggml-cpu/arch/x86/quants.c | cat -n

Repository: ngxson/llama.cpp

Length of output: 1409


Comments reference "6 outliers" but code processes 8.

Lines 2458 and 2460 mention "6 outliers" while the loop unrolls exactly 8 iterations (lines 2469-2476) and Q3_HIFI_OUTLIERS is defined as 8. Update the comments:

-    // Fully unrolled loop for 6 outliers - eliminates loop overhead
-    // Note: We tried branchless masking but the computation cost outweighs
-    // any branch misprediction savings for only 6 outliers per block.
+    // Fully unrolled loop for 8 outliers - eliminates loop overhead
+    // Note: We tried branchless masking but the computation cost outweighs
+    // any branch misprediction savings for only 8 outliers per block.

The code is correct; only the comments need updating to match Q3_HIFI_OUTLIERS.

🤖 Prompt for AI Agents
In ggml/src/ggml-cpu/arch/x86/quants.c around lines 2457 to 2477, the inline
comments incorrectly reference "6 outliers" while the loop unroll processes 8
outliers (and Q3_HIFI_OUTLIERS is defined as 8); update the comments to say "8
outliers" (or reference Q3_HIFI_OUTLIERS) and remove mention of 6 to match the
actual unrolled code and constant, e.g., change the comment lines that read "6
outliers" to "8 outliers" or "Q3_HIFI_OUTLIERS outliers" so the comments
accurately reflect the implementation.

Comment on lines +3 to +4
// Q3_HIFI dequantization shader
// Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment states 6 outliers, but implementation uses 8.

The comment says "6 FP16 outliers" but line 49 loops over Q3_HIFI_OUTLIERS which is defined as 8 elsewhere in the codebase. Update the comment to match the actual implementation.

🔎 Suggested fix
 // Q3_HIFI dequantization shader
-// Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers
+// Uses Q3_K-compatible layout (hmask + qs + scales) with 8 FP16 outliers
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Q3_HIFI dequantization shader
// Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers
// Q3_HIFI dequantization shader
// Uses Q3_K-compatible layout (hmask + qs + scales) with 8 FP16 outliers
🤖 Prompt for AI Agents
In ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp around lines 3 to 4,
the header comment states "6 FP16 outliers" but the implementation loops over
Q3_HIFI_OUTLIERS which is defined as 8; update the comment to reflect the actual
implementation (e.g., change "6 FP16 outliers" to "8 FP16 outliers" or remove
the hardcoded count) so the comment matches the code.

Comment on lines +4 to +6
// Q3_HIFI matrix-vector multiplication shader
// Uses Q3_K-compatible layout, outlier correction skipped on GPU for simplicity
// (outliers are still applied on CPU for full quality)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

GPU quality differs from CPU - outlier correction skipped.

The comment explicitly states outlier correction is skipped on GPU. This is a significant quality difference since outliers are the defining feature of Q3_HIFI that distinguishes it from Q3_K. Users may not expect different results between CPU and GPU backends.

Consider either:

  1. Implementing outlier correction in the shader (adds ~8 FMA ops per block)
  2. Documenting this limitation prominently in user-facing documentation
  3. Adding a warning when Q3_HIFI models are used on Vulkan
🤖 Prompt for AI Agents
In ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp around lines 4-6
the comment notes outlier correction is skipped on GPU causing Q3_HIFI results
to differ from CPU; resolve by either (A) implement outlier correction in the
shader by adding the per-block outlier adjustment: load outlier values per
block, apply the ~8 FMA operations per block after the core dot-product
accumulation and before writing results, taking care to preserve precision and
memory alignment and testing performance impact; or (B) if shader implementation
is unacceptable, add a prominent user-facing change: update documentation to
clearly state GPU backend skips outlier correction for Q3_HIFI and add a runtime
warning emitted when a Q3_HIFI model is loaded on Vulkan to notify users of
potential quality differences.

Comment on lines +298 to +301
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_HIFI) {
// Adaptive Q3_HIFI: use Q3_HIFI for ALL attn_v layers (consistently sensitive)
new_type = GGML_TYPE_Q3_HIFI;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Adaptive Q3_HIFI policy is coherent, but fallback handling for GGML_TYPE_Q3_HIFI is missing.

The new ftype wiring for LLAMA_FTYPE_MOSTLY_Q3_HIFI looks logically consistent:

  • attn_v.weight → always GGML_TYPE_Q3_HIFI (these are known to be sensitive).
  • ffn_down → first ~1/3 of layers Q3_HIFI, middle band Q4_K when use_more_bits is true, rest Q3_K.
  • attn_output.weight and attn_qkv.weight → treated as Q4_K.
  • Default type for this ftype is GGML_TYPE_Q3_K, matching the “Q3_K base, Q3_HIFI on sensitive layers” comment.

However, in the convert_incompatible_tensor block at the end of llama_tensor_get_type, there is no case for GGML_TYPE_Q3_HIFI. If a tensor assigned Q3_HIFI ends up with nx % ggml_blck_size(Q3_HIFI) != 0, the switch will hit the default branch and throw:

default: throw std::runtime_error("\nUnsupported tensor size encountered\n");

Existing K-quants (including GGML_TYPE_Q3_K) instead fall back to a more permissive format (e.g., IQ4_NL) in this situation.

To make Q3_HIFI robust to odd geometries and consistent with the rest of the K-quants, add an explicit fallback case for GGML_TYPE_Q3_HIFI, e.g.:

Proposed fallback mapping for GGML_TYPE_Q3_HIFI
-        switch (new_type) {
+        switch (new_type) {
             case GGML_TYPE_TQ1_0:
             case GGML_TYPE_TQ2_0:  new_type = GGML_TYPE_Q4_0; break;  // TODO: use a symmetric type instead
             case GGML_TYPE_IQ2_XXS:
@@
-            case GGML_TYPE_Q2_K:
-            case GGML_TYPE_Q3_K:
+            case GGML_TYPE_Q2_K:
+            case GGML_TYPE_Q3_K:
+            case GGML_TYPE_Q3_HIFI:
             case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;

This preserves the “Q3_HIFI as a K-quant” behavior for valid shapes, while giving you a graceful degradation path instead of a hard failure for incompatible tensor sizes. Based on learnings, this is exactly the sort of issue that should be relayed upstream.

Also applies to: 355-360, 404-405, 413-415, 586-587

🤖 Prompt for AI Agents
In src/llama-quant.cpp around lines 298-301 (and similarly at the other affected
ranges 355-360, 404-405, 413-415, 586-587), the convert_incompatible_tensor
switch lacks a case for GGML_TYPE_Q3_HIFI which causes a runtime throw when nx %
ggml_blck_size(GGML_TYPE_Q3_HIFI) != 0; add an explicit case for
GGML_TYPE_Q3_HIFI that mirrors the existing fallback used for K-quant types
(e.g., map it to the same permissive fallback format used for GGML_TYPE_Q3_K /
other K-quant fallbacks such as the IQ4_NL-equivalent branch), so tensors that
cannot use packed Q3_HIFI gracefully degrade instead of hitting the default
error path.

Comment on lines +23 to +26
import logging

# Configuration
PPL_THRESHOLD = 25.0 # Reasonable threshold for 3-bit quantization
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing logging.basicConfig() - info/debug messages won't appear.

"The handler's level is set to WARNING, so all events at this and greater severities will be output." Since there's no logging.basicConfig() call, all logging.info() and logging.debug() calls (lines 101, 147, 148, 154, 162, 183, 184, 187) will be silently dropped. The script will appear to run silently on success, only showing output on errors.

Suggested fix

Add configuration after the imports:

 import logging
 
+# Configure logging to show info-level messages
+logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
+
 # Configuration
 PPL_THRESHOLD = 25.0  # Reasonable threshold for 3-bit quantization
🤖 Prompt for AI Agents
In tests/test-q3-hifi.py around lines 23 to 26, logging is imported but no basic
configuration is set so info/debug messages are dropped; add a
logging.basicConfig() call immediately after the imports to set an appropriate
level (INFO or DEBUG) and a simple format so logging.info()/debug() calls (lines
101, 147, 148, 154, 162, 183, 184, 187) are emitted during successful runs.

Comment on lines +84 to +89
PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "")

if [ -z "$PPL" ]; then
# Try alternate format: just look for the last PPL value
PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "")
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

grep -oP is not portable to macOS.

The -P (Perl regex) flag is a GNU grep extension and is not available on macOS's BSD grep by default. This will cause the test to fail on macOS.

Consider using a more portable approach:

🔎 Proposed portable fix
 # Extract final perplexity value
 # Format: "Final estimate: PPL = X.XXXX +/- Y.YYYY"
-PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "")
+PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*Final estimate: PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)

 if [ -z "$PPL" ]; then
     # Try alternate format: just look for the last PPL value
-    PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "")
+    PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)
 fi

As per coding guidelines, cross-platform compatibility should be tested on Linux, macOS, and Windows.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "")
if [ -z "$PPL" ]; then
# Try alternate format: just look for the last PPL value
PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "")
fi
PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*Final estimate: PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)
if [ -z "$PPL" ]; then
# Try alternate format: just look for the last PPL value
PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)
fi
🤖 Prompt for AI Agents
In tests/test-q3-hifi.sh around lines 84–89, the script uses grep -oP (Perl
regex) which is not available on macOS; replace the -P usage with a
POSIX-portable extraction using grep -E (or plain grep) piped to sed or awk to
capture the numeric PPL value (e.g., match "PPL =" then extract the following
float), and keep the fallback that picks the last match if empty; ensure quoting
and exit-on-failure behavior remain unchanged so the script works on Linux and
macOS.

@ngxson
Copy link
Owner Author

ngxson commented Dec 22, 2025

@coderabbitai do a review with my "mirror PR" preset. then answer: is there any signs this PR is AI-generated?

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for suspicious AI-generated comment patterns across the codebase
echo "=== Searching for suspicious AI-generated patterns ==="
echo ""
echo "--- Looking for 'CRITICAL:' comments ---"
rg -i "CRITICAL:" --type c --type cpp --type h --type py --type sh -C 2 || echo "None found"
echo ""
echo "--- Looking for 'You are correct' patterns ---"
rg -i "you are correct" --type c --type cpp --type h --type py --type sh -C 2 || echo "None found"
echo ""
echo "--- Looking for 'NOTE:' or 'IMPORTANT:' in unusual places ---"
rg -i "^\s*(//|#)\s*(NOTE|IMPORTANT|WARNING):" --type c --type cpp --type h --type py -C 1 | head -50 || echo "None found"
echo ""
echo "--- Looking for overly verbose or explanatory comments ---"
rg -i "(obviously|clearly|as we can see|it is important to note|it should be noted)" --type c --type cpp --type h -C 2 || echo "None found"
echo ""
echo "--- Checking for TODO comments with unusual formatting ---"
rg "TODO.*:" --type c --type cpp --type h -C 1 | head -30 || echo "None found"

Length of output: 8352


🏁 Script executed:

#!/bin/bash
# Check for duplicate function declarations or suspicious code patterns
echo "=== Checking for duplicate declarations ==="
echo ""
echo "--- Looking for duplicate quantize_row_q3_hifi declarations ---"
rg "quantize_row_q3_hifi" ggml/src/ggml-cpu/quants.h -n
echo ""
echo "--- Looking for duplicate ggml_vec_dot_q3_hifi_q8_K declarations ---"
rg "ggml_vec_dot_q3_hifi_q8_K" ggml/src/ggml-cpu/quants.h -n

Length of output: 1070


🏁 Script executed:

#!/bin/bash
# Check the Q3_HIFI implementation for obvious issues
echo "=== Checking Q3_HIFI block structure definition ==="
cat ggml/src/ggml-common.h | grep -A 20 "typedef struct.*block_q3_hifi"
echo ""
echo "=== Checking static_assert for Q3_HIFI ==="
rg "static_assert.*block_q3_hifi" ggml/src/ggml-common.h -A 1

Length of output: 248


🏁 Script executed:

#!/bin/bash
# Look for inconsistent outlier counts
echo "=== Checking Q3_HIFI_OUTLIERS usage ==="
rg "Q3_HIFI_OUTLIERS" -n --type c --type cpp --type h -C 1

Length of output: 3471


🌐 Web query:

llama.cpp pull request 18246 Q3_HIFI

💡 Result:

I couldn't find a public GitHub pull request numbered 18246 for llama.cpp nor any PR titled "Q3_HIFI". Do you want me to:

  • search a specific repo or fork (ggerganov/llama.cpp, ggml-org/llama.cpp, or a fork), or
  • look for a PR with a different number or name, or
  • fetch the PR URL if you paste it?

Tell me which repo or give the link/any extra detail and I'll look it up.


@ngxson I've completed the review with your mirror PR preset. Here are my findings:

🚨 AI-Generated Code Detected

Yes, there are clear signs this PR contains AI-generated code. I found suspicious "CRITICAL:" comments that match the pattern you asked me to watch for:

  1. ggml/src/ggml-cpu/arch/x86/quants.c (line ~1139):

    // CRITICAL: Use block_q3_hifi instead of block_q3_K for correct stride (128 bytes vs 110 bytes)
  2. ggml/src/ggml-cpu/arch/arm/quants.c (line ~95):

    // CRITICAL: Use block_q3_hifi for correct 128-byte stride

These "CRITICAL:" prefixed comments are a telltale sign of AI-generated code, especially when they state obvious implementation details.

❌ Obvious Mistakes Found

1. Duplicate Function Declarations (ggml/src/ggml-cpu/quants.h)

Every Q3_HIFI function is declared twice:

  • Lines 26-27: quantize_row_q3_hifi declared twice
  • Lines 50-51: ggml_vec_dot_q3_hifi_q8_K declared twice
  • Lines 86-87: ggml_vec_dot_q3_hifi_q8_K_generic declared twice

2. Critical Outlier Count Inconsistency ⚠️

There's a serious mismatch in the outlier count:

Header definition (ggml/src/ggml-common.h):

#define Q3_HIFI_OUTLIERS   8  // Defines 8 outliers

Comments claim (multiple files):

// === OUTLIER EXTENSION (18 bytes) ===
uint8_t outlier_idx[Q3_HIFI_OUTLIERS];   // 6 bytes: outlier positions
ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values

Actual hardcoded loops (ggml/src/ggml-cpu/arch/x86/quants.c):

// Outlier correction (unrolled for 6 outliers)
for (int k = 0; k < 6; ++k) {  // Hardcoded to 6!

The math doesn't work:

  • If Q3_HIFI_OUTLIERS = 8: extension would be 8 + (8×2) = 24 bytes, not 18
  • Comments say "6 outliers" and "18 bytes": 6 + (6×2) = 18 bytes
  • Static assert uses 8, but code loops use 6

This inconsistency could cause memory corruption or incorrect results depending on which value is actually used at runtime.

3. Missing Upstream PR Link Verification

The upstream PR link (ggml-org#18246) couldn't be verified - the repository or PR doesn't appear to exist publicly.

Recommendations

You should relay to the upstream contributor:

  1. Remove duplicate declarations in quants.h
  2. Remove AI-generated "CRITICAL:" comments
  3. Fix the outlier count inconsistency - decide whether it should be 6 or 8 and update ALL locations consistently (header, comments, loops, static asserts)
  4. Verify the correct upstream PR URL

The outlier count issue is the most serious as it could cause runtime bugs.


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants