[Mirror] New quantization type: Q3_HIFI #65

ngxson · 2025-12-22T23:38:16Z

Mirror from upstream PR: ggml-org#18246

Summary by CodeRabbit

New Features
- Added Q3_HIFI quantization type with improved quality and outlier handling
- Added benchmarking tools for model performance testing
Documentation
- Added comprehensive IMatrix reference guide
- Added Q3_HIFI quantization analysis with performance recommendations
Tests
- Added Q3_HIFI validation testing framework
Chores
- Updated .gitignore for model artifacts and datasets
- Added quantization tooling options

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Refactor quantization logic to handle quant_weights for outlier selection and improve clarity in the quantization process.

This document provides a comprehensive comparison of three 3-bit quantization strategies: Q3_HIFI, Q3_K_S, and Q3_K_M. It includes technical specifications, performance benchmarks, and recommendations for production use.

This guide provides a comprehensive overview of importance matrix (imatrix) files, including their purpose, generation, usage during quantization, and best practices for effective implementation.

Added Q3_HIFI type with quantization function and placeholder for dot product implementation.

Updated the comparison of Q3 quantization formats, including detailed descriptions of Q3_HIFI (Pure and Hybrid), Q3_K_S, and Q3_K_M. Added performance benchmarks, recommendations, and updated conclusions based on file size, quality, speed, and memory usage.

Implemented NEON-optimized dequantization for Q3_HIFI format, processing values in blocks for efficiency.

Added AVX2-optimized dequantization function for Q3_HIFI.

Q3_HIFI model ready to go

coderabbitai · 2025-12-22T23:38:36Z

Walkthrough

Introduces Q3_HIFI, a new 3-bit quantization type with 6–8 FP16 outliers per block, implemented across CPU (x86/ARM), GPU (CUDA, Metal, Vulkan, SYCL), quantization infrastructure, and model integration layers. Includes quantization/dequantization kernels, backend dispatch logic, test utilities, and reference documentation.

Changes

Cohort / File(s)	Summary
Core Type Definitions `ggml/include/ggml.h`, `ggml/src/ggml-common.h`	New enum value `GGML_TYPE_Q3_HIFI` and block structure `block_q3_hifi` with Q3_K-compatible layout plus 8 FP16 outlier fields; constants `Q3_HIFI_BLOCK_SIZE` and `Q3_HIFI_OUTLIERS` added.
Quantization Reference & Generic `ggml/src/ggml-quants.c`, `ggml/src/ggml-quants.h`, `ggml/src/ggml.c`	Reference and generic quantization/dequantization paths for Q3_HIFI; type traits registration in main library.
CPU Backend x86 & ARM `ggml/src/ggml-cpu/quants.c`, `ggml/src/ggml-cpu/quants.h`, `ggml/src/ggml-cpu/arch/x86/quants.c`, `ggml/src/ggml-cpu/arch/arm/quants.c`, `ggml/src/ggml-cpu/ggml-cpu.c`, `ggml/src/ggml-cpu/ops.cpp`	Optimized AVX2 and NEON kernels for Q3_HIFI dot product and dequantization; type trait dispatch updated.
CUDA Backend `ggml/src/ggml-cuda/common.cuh`, `ggml/src/ggml-cuda/convert.cu`, `ggml/src/ggml-cuda/dequantize.cuh`, `ggml/src/ggml-cuda/ggml-cuda.cu`, `ggml/src/ggml-cuda/mmq.cu`, `ggml/src/ggml-cuda/mmvq.cu`, `ggml/src/ggml-cuda/vecdotq.cuh`	CUDA kernels for Q3_HIFI dequantization and vector dot products with outlier handling; pipeline dispatch updated across MMVQ and dequant paths.
Metal Backend `ggml/src/ggml-metal/ggml-metal-device.cpp`, `ggml/src/ggml-metal/ggml-metal-impl.h`, `ggml/src/ggml-metal/ggml-metal.metal`	Metal kernels for Q3_HIFI dequantization, mat-vec, and mat-mat operations with FP32/FP16 variants; resource sizing constants and host name bindings.
Vulkan Backend `ggml/src/ggml-vulkan/ggml-vulkan.cpp`, `ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp`, `ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp`, `ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp`	Vulkan compute shaders for Q3_HIFI dequantization and mat-vec with outlier restoration; dispatcher backend support extended.
SYCL Backend `ggml/src/ggml-sycl/convert.cpp`, `ggml/src/ggml-sycl/dequantize.hpp`, `ggml/src/ggml-sycl/mmvq.cpp`, `ggml/src/ggml-sycl/vecdotq.hpp`	SYCL implementations for Q3_HIFI dequantization, vector dot product, and matrix operations; dispatch logic integrated.
Library & Model Integration `include/llama.h`, `src/llama-model-loader.cpp`, `src/llama-quant.cpp`, `gguf-py/gguf/constants.py`, `convert_hf_to_gguf.py`	Enum value `LLAMA_FTYPE_MOSTLY_Q3_HIFI`, model loader mapping, adaptive quantization logic for sensitive layers, and GGUF constant/type registration.
Tools & Utilities `tools/quantize/quantize.cpp`, `benchmark_speed_test.ps1`	Q3_HIFI quantization option added to CLI tool; PowerShell benchmark script for multi-model speed testing.
Documentation & Tests `docs/quantization/Q3_HIFI.md`, `tests/test-q3-hifi.py`, `tests/test-q3-hifi.sh`, `tests/test-q3-hifi-text.txt`, `IMatrix_Guide.md`	Comprehensive Q3_HIFI analysis document, Python/Bash test scripts for perplexity validation, test corpus, and imatrix workflow guide.
Configuration `.gitignore`	Exclude model artifacts, datasets, and large files from version control.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The changes introduce a new quantization type across 40+ files with backend-specific optimizations (AVX2, NEON, CUDA, Metal, Vulkan, SYCL). While the additions follow existing patterns and maintain modularity, the scope spans multiple architectures with custom kernel implementations, outlier handling logic, and intricate bit-manipulation operations. CPU and GPU kernels require careful validation of correctness, and the adaptive quantization routing in llama-quant.cpp adds conditional logic across multiple code paths.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is minimal and provides only a reference link to the upstream PR without following the repository's PR template structure.	Expand the description to include details about changes, rationale, testing approach, and any breaking changes, following the repository template guidelines.
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.74% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: introduction of a new Q3_HIFI quantization type.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (21)

.gitignore (1)
140-153: Simplify patterns and relocate to Models section.

The ignore patterns are overly specific and verbose. Additionally, they're placed outside the Models section where similar entries belong:

Lines 140–142: List three specific .raw files. Instead, ignore the entire directory.

Lines 143–153: List 11 individual files under Qwen3-1.7B/. Instead, use a single directory pattern.

These appear to be datasets and models used for benchmarking or testing the Q3_HIFI quantization format.
🔎 Proposed refactor to simplify patterns

Consolidate to the Models section (around line 68–74) and use directory-level patterns:
  !/models/.editorconfig
  !/models/ggml-vocab-*.gguf*
  !/models/templates
+
+# Q3_HIFI benchmark & test datasets
+/wikitext-2-raw/
+/Qwen3-1.7B/
Then remove lines 140–153 entirely.
Clarify whether these datasets are:

Generated dynamically at runtime (in which case they shouldn't be in .gitignore at all), or

Downloaded/populated by setup scripts (in which case they should be documented in the repository root or build instructions).
benchmark_speed_test.ps1 (3)
15-19: Consider parameterizing hardcoded paths.

The script uses hardcoded relative paths that assume execution from the repository root. This will fail if run from a different directory.
🔎 Suggested refactor to make paths configurable
 param(
     [int]$Iterations = 100,
     [int]$Threads = 4,
     [int]$Repeats = 3,
     [int]$PromptTokens = 0,
-    [int]$GenerateTokens = 20
+    [int]$GenerateTokens = 20,
+    [string]$LlamaBenchPath = ".\build\bin\Release\llama-bench.exe",
+    [string]$ModelsDir = "."
 )

 $ErrorActionPreference = "Stop"

 # Configuration
-$LlamaBench = ".\build\bin\Release\llama-bench.exe"
+$LlamaBench = $LlamaBenchPath
 $Models = @(
-    @{ Name = "Q3_K_S"; Path = ".\Qwen3-1.7B-f16-Q3_K_S.gguf" },
-    @{ Name = "Q3_K_M"; Path = ".\Qwen3-1.7B-f16-Q3_K_M.gguf" },
-    @{ Name = "Q3_HIFI"; Path = ".\Qwen3-1.7B-f16-Q3_HIFI.gguf" }
+    @{ Name = "Q3_K_S"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_S.gguf" },
+    @{ Name = "Q3_K_M"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_M.gguf" },
+    @{ Name = "Q3_HIFI"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_HIFI.gguf" }
 )
76-115: Consider validating parsed speed values.

The script parses speed values from llama-bench output but doesn't validate that the extracted values are positive and reasonable. Invalid or zero values could skew statistics.
🔎 Suggested validation
                 if ($lineStr -match "tg\d+\s*\|\s*([\d.]+)\s*±\s*([\d.]+)") {
                     $speed = [double]$Matches[1]
-                    [void]$Results[$model.Name].Speeds.Add($speed)
-                    $found = $true
-                    break
+                    if ($speed -gt 0) {
+                        [void]$Results[$model.Name].Speeds.Add($speed)
+                        $found = $true
+                        break
+                    }
                 }
                 # Alternative pattern: just numbers at end of line
                 elseif ($lineStr -match "\|\s*tg\d+\s*\|\s*([\d.]+)") {
                     $speed = [double]$Matches[1]
-                    [void]$Results[$model.Name].Speeds.Add($speed)
-                    $found = $true
-                    break
+                    if ($speed -gt 0) {
+                        [void]$Results[$model.Name].Speeds.Add($speed)
+                        $found = $true
+                        break
+                    }
                 }
180-243: Consider adding error summary to report.

Errors are counted and included in the CSV export, but there's no prominent summary in the console output. Users might miss that benchmarks had failures, leading to potentially misleading statistics.
🔎 Suggested addition

Add after line 188:
 Write-Host "Total iterations per model: $Iterations"
+
+# Error summary
+$TotalErrors = ($Results.Values | ForEach-Object { $_.Errors } | Measure-Object -Sum).Sum
+if ($TotalErrors -gt 0) {
+    Write-Host ""
+    Write-Host "WARNING: $TotalErrors total benchmark failures detected" -ForegroundColor Red
+    foreach ($model in $Models) {
+        $errors = $Results[$model.Name].Errors
+        if ($errors -gt 0) {
+            Write-Host "  - $($model.Name): $errors failed runs" -ForegroundColor Yellow
+        }
+    }
+}
 Write-Host ""
tests/test-q3-hifi.sh (1)
102-102: Dependency on bc may cause failures on minimal systems.

The bc utility may not be installed on all systems (e.g., minimal Docker containers, some CI environments). Consider adding a check or using an alternative.
🔎 Option 1: Add dependency check
+# Check for required commands
+if ! command -v bc &> /dev/null; then
+    echo "Error: 'bc' is required but not installed."
+    exit 1
+fi
+
 # Check if PPL is reasonable (less than threshold)
 if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then
🔎 Option 2: Use awk instead (more portable)
 # Check if PPL is reasonable (less than threshold)
-if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then
+if awk "BEGIN {exit !($PPL < $PPL_THRESHOLD)}"; then
     echo "✅ Test PASSED: PPL ($PPL) is below threshold ($PPL_THRESHOLD)"
ggml/src/ggml-sycl/dequantize.hpp (2)
383-388: Consider performance implications of per-element outlier scanning.

The outlier check loop runs for every element (4 elements per thread × 6 outlier checks = 24 comparisons). While functionally correct, this linear scan could be a bottleneck for GPU execution.

For a future optimization, consider:

Pre-sorting outlier indices to enable binary search

Using a bitmask or lookup table if Q3_HIFI_OUTLIERS is small and fixed

That said, with only 6 outliers per block and the break statement, this is likely acceptable.

415-418: Missing early break in else branch outlier loop.

Unlike the QK_K==256 branch (line 387), this branch doesn't use break after finding a match. While functionally correct (it just overwrites the same value), adding early termination would improve consistency and performance.
🔎 Proposed fix
     // Check for outliers
     int idx0 = 16*is + il;
     int idx1 = 16*is + il + 32;
+    bool found0 = false, found1 = false;
     for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-        if (x[i].outlier_idx[k] == idx0) val0 = x[i].outlier_vals[k];
-        if (x[i].outlier_idx[k] == idx1) val1 = x[i].outlier_vals[k];
+        if (!found0 && x[i].outlier_idx[k] == idx0) { val0 = x[i].outlier_vals[k]; found0 = true; }
+        if (!found1 && x[i].outlier_idx[k] == idx1) { val1 = x[i].outlier_vals[k]; found1 = true; }
+        if (found0 && found1) break;
     }
IMatrix_Guide.md (1)

345-348: Consider qualifying the perplexity improvement claim.

The statement "Typically 5-15% better perplexity" is a helpful ballpark figure, but perplexity improvements can vary significantly based on model architecture, calibration data quality, and other factors.

Consider adding a qualifier such as "results may vary based on model and calibration data" to set appropriate expectations.
ggml/src/ggml-cuda/vecdotq.cuh (1)
775-847: Defensively bound-check outlier_idx[k] before using it.

The Q3_HIFI outlier loop assumes every outlier_idx[k] is a valid index into the 256‑weight block. If any unused slots are left uninitialized or use a sentinel outside [0, QK_K), this can lead to undefined behavior when indexing bq8_1[idx_bq8] / qs[idx_in_bq8].

Given the loop runs for all k < Q3_HIFI_OUTLIERS regardless of how many real outliers exist, it would be safer to explicitly skip out-of-range indices:
Proposed defensive guard for outlier indices
-#pragma unroll
-    for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-        const int idx = bq3_hifi->outlier_idx[k];
+#pragma unroll
+    for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
+        const int idx = bq3_hifi->outlier_idx[k];
+        // Skip unused/sentinel entries and out-of-range indices
+        if (idx < 0 || idx >= QK_K) {
+            continue;
+        }
If the quantization path already guarantees that unused slots have in‑range indices and zero values, this is purely defensive, but it makes the assumptions explicit and hardens the kernel against future layout changes.
include/llama.h (1)

155-158: Public enum change: Q3_HIFI_OLD/UNIFORM removal is a source-level API break

Removing LLAMA_FTYPE_MOSTLY_Q3_HIFI_OLD (39) and LLAMA_FTYPE_MOSTLY_Q3_HIFI_UNIFORM (40) and only exposing LLAMA_FTYPE_MOSTLY_Q3_HIFI = 41 means any external code that referenced the old names will no longer compile, and older GGUF models using 39/40 will now be reported as “unknown” by llama_model_ftype_name().

Given include/llama.h is the public API surface, it’s worth double‑checking that:

Those legacy ftypes were never documented/stabilized, or

Downstream users are expected to migrate and this change is communicated in release notes.

If backward compatibility is a concern, one option is to keep the old enumerators (possibly marked as legacy in comments) and explicitly map them at load time to the new Q3_HIFI handling.

ggml/src/ggml.c (1)

735-742: Q3_HIFI type traits and quantize wiring look consistent with existing formats

The type_traits[GGML_TYPE_Q3_HIFI] entry and the new GGML_TYPE_Q3_HIFI branch in ggml_quantize_chunk are hooked up in the same way as other block formats (matching block type, block size macro, and *_ref quant/dequant helpers), so the core type system and CPU quantization path look correctly extended.

Minor, non‑blocking nit: most other type_name strings use lowercase (e.g. "q3_K"), while "Q3_HIFI" is uppercase; keep as‑is if it’s intentional branding, otherwise consider normalizing for consistency.

Based on learnings, treating this as upstream‑facing feedback rather than a strict change request for the mirror.

Also applies to: 7548-7549

ggml/src/ggml-cuda/dequantize.cuh (1)

79-130: Q3_HIFI CUDA dequantization matches Q3_K layout with sensible outlier handling

The Q3 bit unpacking (qs / hmask math, index ranges, and -4 offset) is consistent with the documented Q3_K layout, and the small unrolled loop over Q3_HIFI_OUTLIERS to restore FP16 outliers is straightforward and cheap on device. From a CUDA side this looks sound.

Two minor follow‑ups you may want to sanity‑check against the CPU reference:

Ensure block_q3_hifi’s fields (d, qs, hmask, outlier_idx, outlier_vals) and Q3_HIFI_OUTLIERS exactly match these assumptions (Q3_K‑compatible packing plus a fixed outlier tail).

The comments disagree on the outlier count (“6 FP16 outliers” vs “only 8 per 256 weights”); consider aligning them to whatever Q3_HIFI_OUTLIERS is actually set to, just to avoid confusion.
ggml/src/ggml-cpu/quants.c (2)
582-582: Inconsistent FP16 conversion macro.

This line uses GGML_FP16_TO_FP32 while other functions in this file (e.g., ggml_vec_dot_q3_K_q8_K_generic at line 549) use GGML_CPU_FP16_TO_FP32. Use the CPU-specific macro for consistency.
🔎 Suggested fix
-        const float d = GGML_FP16_TO_FP32(xb->d) * yb->d;
+        const float d = GGML_CPU_FP16_TO_FP32(xb->d) * yb->d;
631-638: Same FP16 conversion inconsistency in outlier corrections.

These lines also use GGML_FP16_TO_FP32 instead of GGML_CPU_FP16_TO_FP32. Apply the same fix throughout the function.
🔎 Suggested fix
-        total_sum += GGML_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd;
-        total_sum += GGML_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd;
+        total_sum += GGML_CPU_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd;
ggml/src/ggml-sycl/convert.cpp (1)

117-147: SYCL Q3_HIFI wiring matches existing Q3_K path; align the outlier-count comment.

The new dequantize_row_q3_hifi_sycl wrapper and its use in both ggml_get_to_fp16_sycl and ggml_get_to_fp32_sycl mirror the existing q3_K implementation and look correct.

One minor nit: the comment at Line 117 says “Q3_HIFI: Q3_K-compatible layout with 6 FP16 outliers”, while the shared layout (via block_q3_hifi/Q3_HIFI_OUTLIERS) and Python constants describe Q3_HIFI in terms of the generic Q3_HIFI_OUTLIERS count (documented as 8 outliers in other places). To avoid confusion if the constant ever changes, consider rephrasing the comment to reference Q3_HIFI_OUTLIERS or to match the canonical outlier count used in the block definition.

Also applies to: 574-576, 640-642
ggml/src/ggml-cuda/convert.cu (1)
555-566: Consider distributing outlier restoration across threads.

Currently, only thread 0 handles all outliers while 63 threads idle after the barrier. For 6-8 outliers this overhead is minimal, but if Q3_HIFI_OUTLIERS increases, consider having multiple threads participate:
     // Synchronize before overwriting outliers
     __syncthreads();

-    // Thread 0 handles outlier restoration
-    if (threadIdx.x == 0) {
-        dst_t * yb = yy + i*QK_K;
-        #pragma unroll
-        for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
-            const int idx = x[i].outlier_idx[k];
-            yb[idx] = __half2float(x[i].outlier_vals[k]);
-        }
-    }
+    // Distribute outlier restoration across threads
+    if (threadIdx.x < Q3_HIFI_OUTLIERS) {
+        dst_t * yb = yy + i*QK_K;
+        const int k = threadIdx.x;
+        const int idx = x[i].outlier_idx[k];
+        yb[idx] = __half2float(x[i].outlier_vals[k]);
+    }
This avoids thread divergence and utilizes available parallelism.
tests/test-q3-hifi.py (1)
140-145: Consider using logging.exception() in except blocks for better diagnostics.

Per static analysis hint, logging.exception() automatically includes the stack trace, which can aid debugging. However, for straightforward errors like these, the current approach is acceptable.
Optional improvement
     try:
         perplexity_exe = find_executable("llama-perplexity", build_dir)
     except FileNotFoundError as e:
-        logging.error("Error: %s", e)
+        logging.exception("Could not find perplexity executable")
         logging.info("Make sure you've built llama.cpp first.")
         return 1
ggml/src/ggml-cpu/arch/arm/quants.c (1)
4219-4219: Use symbolic constant instead of magic number 96.

The hardcoded constant 96 at line 4219 should be replaced with (Q3_HIFI_BLOCK_SIZE * 3 / 8) for clarity and maintainability. The value 96 represents the number of bytes needed to store 256 3-bit values.
🔎 Proposed refactor
-                if (bit_offset > 5 && byte_idx + 1 < 96) {
+                if (bit_offset > 5 && byte_idx + 1 < (Q3_HIFI_BLOCK_SIZE * 3 / 8)) {
                     bits |= (qs[byte_idx + 1] << (8 - bit_offset)) & 7;
                 }
ggml/src/ggml-quants.c (3)
1398-1401: Layout compatibility assumption lacks compile-time verification.

The cast from block_q3_hifi* to block_q3_K* relies on the comment at lines 1399-1400 stating "first 110 bytes are identical layout." Consider adding a compile-time assertion in the header to catch layout drift:
static_assert(offsetof(block_q3_hifi, hmask) == offsetof(block_q3_K, hmask) &&
              offsetof(block_q3_hifi, qs) == offsetof(block_q3_K, qs) &&
              offsetof(block_q3_hifi, scales) == offsetof(block_q3_K, scales) &&
              offsetof(block_q3_hifi, d) == offsetof(block_q3_K, d),
              "block_q3_hifi must be layout-compatible with block_q3_K");
5148-5151: Empty section - appears to be vestigial.

These lines add a comment block but no implementation follows. If this was intended as a placeholder for future code, consider removing it or adding the intended content.

5463-5466: Validation only checks d field, not outlier FP16 values.

The VALIDATE_ROW_DATA_D_F16_IMPL macro validates the block's d field, but block_q3_hifi also contains outlier_vals[Q3_HIFI_OUTLIERS] which are FP16 values that could contain NaN/Inf. Consider extending validation:
 case GGML_TYPE_Q3_HIFI:
     {
-        VALIDATE_ROW_DATA_D_F16_IMPL(block_q3_hifi, data, nb);
+        const block_q3_hifi * q = (const block_q3_hifi *) data;
+        for (size_t i = 0; i < nb; ++i) {
+            if (!validate_fp16(q[i].d, i)) {
+                return false;
+            }
+            for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) {
+                if (!validate_fp16(q[i].outlier_vals[k], i)) {
+                    return false;
+                }
+            }
+        }
     } break;

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfc959b and e4fd98f.

⛔ Files ignored due to path filters (3)

ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glsl is excluded by !**/*.glsl
ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.glsl is excluded by !**/*.glsl
ggml/src/ggml-vulkan/vulkan-shaders/types.glsl is excluded by !**/*.glsl

📒 Files selected for processing (43)

.gitignore
IMatrix_Guide.md
benchmark_speed_test.ps1
convert_hf_to_gguf.py
docs/quantization/Q3_HIFI.md
ggml/include/ggml.h
ggml/src/ggml-common.h
ggml/src/ggml-cpu/arch/arm/quants.c
ggml/src/ggml-cpu/arch/x86/quants.c
ggml/src/ggml-cpu/ggml-cpu.c
ggml/src/ggml-cpu/ops.cpp
ggml/src/ggml-cpu/quants.c
ggml/src/ggml-cpu/quants.h
ggml/src/ggml-cuda/common.cuh
ggml/src/ggml-cuda/convert.cu
ggml/src/ggml-cuda/dequantize.cuh
ggml/src/ggml-cuda/ggml-cuda.cu
ggml/src/ggml-cuda/mmq.cu
ggml/src/ggml-cuda/mmvq.cu
ggml/src/ggml-cuda/vecdotq.cuh
ggml/src/ggml-metal/ggml-metal-device.cpp
ggml/src/ggml-metal/ggml-metal-impl.h
ggml/src/ggml-metal/ggml-metal.metal
ggml/src/ggml-quants.c
ggml/src/ggml-quants.h
ggml/src/ggml-sycl/convert.cpp
ggml/src/ggml-sycl/dequantize.hpp
ggml/src/ggml-sycl/mmvq.cpp
ggml/src/ggml-sycl/vecdotq.hpp
ggml/src/ggml-vulkan/ggml-vulkan.cpp
ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_iq1_s.comp
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp
ggml/src/ggml.c
gguf-py/gguf/constants.py
include/llama.h
src/llama-model-loader.cpp
src/llama-quant.cpp
tests/test-q3-hifi-text.txt
tests/test-q3-hifi.py
tests/test-q3-hifi.sh
tools/quantize/quantize.cpp

🧰 Additional context used

📓 Path-based instructions (6)

.gitignore