-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] New quantization type: Q3_HIFI #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Refactor quantization logic to handle quant_weights for outlier selection and improve clarity in the quantization process.
This document provides a comprehensive comparison of three 3-bit quantization strategies: Q3_HIFI, Q3_K_S, and Q3_K_M. It includes technical specifications, performance benchmarks, and recommendations for production use.
This guide provides a comprehensive overview of importance matrix (imatrix) files, including their purpose, generation, usage during quantization, and best practices for effective implementation.
Added Q3_HIFI type with quantization function and placeholder for dot product implementation.
Updated the comparison of Q3 quantization formats, including detailed descriptions of Q3_HIFI (Pure and Hybrid), Q3_K_S, and Q3_K_M. Added performance benchmarks, recommendations, and updated conclusions based on file size, quality, speed, and memory usage.
Implemented NEON-optimized dequantization for Q3_HIFI format, processing values in blocks for efficiency.
Added AVX2-optimized dequantization function for Q3_HIFI.
Q3_HIFI model ready to go
WalkthroughIntroduces Q3_HIFI, a new 3-bit quantization type with 6–8 FP16 outliers per block, implemented across CPU (x86/ARM), GPU (CUDA, Metal, Vulkan, SYCL), quantization infrastructure, and model integration layers. Includes quantization/dequantization kernels, backend dispatch logic, test utilities, and reference documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The changes introduce a new quantization type across 40+ files with backend-specific optimizations (AVX2, NEON, CUDA, Metal, Vulkan, SYCL). While the additions follow existing patterns and maintain modularity, the scope spans multiple architectures with custom kernel implementations, outlier handling logic, and intricate bit-manipulation operations. CPU and GPU kernels require careful validation of correctness, and the adaptive quantization routing in Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 15
🧹 Nitpick comments (21)
.gitignore (1)
140-153: Simplify patterns and relocate to Models section.The ignore patterns are overly specific and verbose. Additionally, they're placed outside the Models section where similar entries belong:
- Lines 140–142: List three specific
.rawfiles. Instead, ignore the entire directory.- Lines 143–153: List 11 individual files under
Qwen3-1.7B/. Instead, use a single directory pattern.These appear to be datasets and models used for benchmarking or testing the Q3_HIFI quantization format.
🔎 Proposed refactor to simplify patterns
Consolidate to the Models section (around line 68–74) and use directory-level patterns:
!/models/.editorconfig !/models/ggml-vocab-*.gguf* !/models/templates + +# Q3_HIFI benchmark & test datasets +/wikitext-2-raw/ +/Qwen3-1.7B/Then remove lines 140–153 entirely.
Clarify whether these datasets are:
- Generated dynamically at runtime (in which case they shouldn't be in
.gitignoreat all), or- Downloaded/populated by setup scripts (in which case they should be documented in the repository root or build instructions).
benchmark_speed_test.ps1 (3)
15-19: Consider parameterizing hardcoded paths.The script uses hardcoded relative paths that assume execution from the repository root. This will fail if run from a different directory.
🔎 Suggested refactor to make paths configurable
param( [int]$Iterations = 100, [int]$Threads = 4, [int]$Repeats = 3, [int]$PromptTokens = 0, - [int]$GenerateTokens = 20 + [int]$GenerateTokens = 20, + [string]$LlamaBenchPath = ".\build\bin\Release\llama-bench.exe", + [string]$ModelsDir = "." ) $ErrorActionPreference = "Stop" # Configuration -$LlamaBench = ".\build\bin\Release\llama-bench.exe" +$LlamaBench = $LlamaBenchPath $Models = @( - @{ Name = "Q3_K_S"; Path = ".\Qwen3-1.7B-f16-Q3_K_S.gguf" }, - @{ Name = "Q3_K_M"; Path = ".\Qwen3-1.7B-f16-Q3_K_M.gguf" }, - @{ Name = "Q3_HIFI"; Path = ".\Qwen3-1.7B-f16-Q3_HIFI.gguf" } + @{ Name = "Q3_K_S"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_S.gguf" }, + @{ Name = "Q3_K_M"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_K_M.gguf" }, + @{ Name = "Q3_HIFI"; Path = Join-Path $ModelsDir "Qwen3-1.7B-f16-Q3_HIFI.gguf" } )
76-115: Consider validating parsed speed values.The script parses speed values from llama-bench output but doesn't validate that the extracted values are positive and reasonable. Invalid or zero values could skew statistics.
🔎 Suggested validation
if ($lineStr -match "tg\d+\s*\|\s*([\d.]+)\s*±\s*([\d.]+)") { $speed = [double]$Matches[1] - [void]$Results[$model.Name].Speeds.Add($speed) - $found = $true - break + if ($speed -gt 0) { + [void]$Results[$model.Name].Speeds.Add($speed) + $found = $true + break + } } # Alternative pattern: just numbers at end of line elseif ($lineStr -match "\|\s*tg\d+\s*\|\s*([\d.]+)") { $speed = [double]$Matches[1] - [void]$Results[$model.Name].Speeds.Add($speed) - $found = $true - break + if ($speed -gt 0) { + [void]$Results[$model.Name].Speeds.Add($speed) + $found = $true + break + } }
180-243: Consider adding error summary to report.Errors are counted and included in the CSV export, but there's no prominent summary in the console output. Users might miss that benchmarks had failures, leading to potentially misleading statistics.
🔎 Suggested addition
Add after line 188:
Write-Host "Total iterations per model: $Iterations" + +# Error summary +$TotalErrors = ($Results.Values | ForEach-Object { $_.Errors } | Measure-Object -Sum).Sum +if ($TotalErrors -gt 0) { + Write-Host "" + Write-Host "WARNING: $TotalErrors total benchmark failures detected" -ForegroundColor Red + foreach ($model in $Models) { + $errors = $Results[$model.Name].Errors + if ($errors -gt 0) { + Write-Host " - $($model.Name): $errors failed runs" -ForegroundColor Yellow + } + } +} Write-Host ""tests/test-q3-hifi.sh (1)
102-102: Dependency onbcmay cause failures on minimal systems.The
bcutility may not be installed on all systems (e.g., minimal Docker containers, some CI environments). Consider adding a check or using an alternative.🔎 Option 1: Add dependency check
+# Check for required commands +if ! command -v bc &> /dev/null; then + echo "Error: 'bc' is required but not installed." + exit 1 +fi + # Check if PPL is reasonable (less than threshold) if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then🔎 Option 2: Use awk instead (more portable)
# Check if PPL is reasonable (less than threshold) -if (( $(echo "$PPL < $PPL_THRESHOLD" | bc -l) )); then +if awk "BEGIN {exit !($PPL < $PPL_THRESHOLD)}"; then echo "✅ Test PASSED: PPL ($PPL) is below threshold ($PPL_THRESHOLD)"ggml/src/ggml-sycl/dequantize.hpp (2)
383-388: Consider performance implications of per-element outlier scanning.The outlier check loop runs for every element (4 elements per thread × 6 outlier checks = 24 comparisons). While functionally correct, this linear scan could be a bottleneck for GPU execution.
For a future optimization, consider:
- Pre-sorting outlier indices to enable binary search
- Using a bitmask or lookup table if Q3_HIFI_OUTLIERS is small and fixed
That said, with only 6 outliers per block and the break statement, this is likely acceptable.
415-418: Missing early break in else branch outlier loop.Unlike the QK_K==256 branch (line 387), this branch doesn't use
breakafter finding a match. While functionally correct (it just overwrites the same value), adding early termination would improve consistency and performance.🔎 Proposed fix
// Check for outliers int idx0 = 16*is + il; int idx1 = 16*is + il + 32; + bool found0 = false, found1 = false; for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) { - if (x[i].outlier_idx[k] == idx0) val0 = x[i].outlier_vals[k]; - if (x[i].outlier_idx[k] == idx1) val1 = x[i].outlier_vals[k]; + if (!found0 && x[i].outlier_idx[k] == idx0) { val0 = x[i].outlier_vals[k]; found0 = true; } + if (!found1 && x[i].outlier_idx[k] == idx1) { val1 = x[i].outlier_vals[k]; found1 = true; } + if (found0 && found1) break; }IMatrix_Guide.md (1)
345-348: Consider qualifying the perplexity improvement claim.The statement "Typically 5-15% better perplexity" is a helpful ballpark figure, but perplexity improvements can vary significantly based on model architecture, calibration data quality, and other factors.
Consider adding a qualifier such as "results may vary based on model and calibration data" to set appropriate expectations.
ggml/src/ggml-cuda/vecdotq.cuh (1)
775-847: Defensively bound-checkoutlier_idx[k]before using it.The Q3_HIFI outlier loop assumes every
outlier_idx[k]is a valid index into the 256‑weight block. If any unused slots are left uninitialized or use a sentinel outside[0, QK_K), this can lead to undefined behavior when indexingbq8_1[idx_bq8]/qs[idx_in_bq8].Given the loop runs for all
k < Q3_HIFI_OUTLIERSregardless of how many real outliers exist, it would be safer to explicitly skip out-of-range indices:Proposed defensive guard for outlier indices
-#pragma unroll - for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) { - const int idx = bq3_hifi->outlier_idx[k]; +#pragma unroll + for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) { + const int idx = bq3_hifi->outlier_idx[k]; + // Skip unused/sentinel entries and out-of-range indices + if (idx < 0 || idx >= QK_K) { + continue; + }If the quantization path already guarantees that unused slots have in‑range indices and zero values, this is purely defensive, but it makes the assumptions explicit and hardens the kernel against future layout changes.
include/llama.h (1)
155-158: Public enum change: Q3_HIFI_OLD/UNIFORM removal is a source-level API breakRemoving
LLAMA_FTYPE_MOSTLY_Q3_HIFI_OLD(39) andLLAMA_FTYPE_MOSTLY_Q3_HIFI_UNIFORM(40) and only exposingLLAMA_FTYPE_MOSTLY_Q3_HIFI = 41means any external code that referenced the old names will no longer compile, and older GGUF models using 39/40 will now be reported as “unknown” byllama_model_ftype_name().Given
include/llama.his the public API surface, it’s worth double‑checking that:
- Those legacy ftypes were never documented/stabilized, or
- Downstream users are expected to migrate and this change is communicated in release notes.
If backward compatibility is a concern, one option is to keep the old enumerators (possibly marked as legacy in comments) and explicitly map them at load time to the new Q3_HIFI handling.
ggml/src/ggml.c (1)
735-742: Q3_HIFI type traits and quantize wiring look consistent with existing formatsThe
type_traits[GGML_TYPE_Q3_HIFI]entry and the newGGML_TYPE_Q3_HIFIbranch inggml_quantize_chunkare hooked up in the same way as other block formats (matching block type, block size macro, and *_ref quant/dequant helpers), so the core type system and CPU quantization path look correctly extended.Minor, non‑blocking nit: most other
type_namestrings use lowercase (e.g."q3_K"), while"Q3_HIFI"is uppercase; keep as‑is if it’s intentional branding, otherwise consider normalizing for consistency.Based on learnings, treating this as upstream‑facing feedback rather than a strict change request for the mirror.
Also applies to: 7548-7549
ggml/src/ggml-cuda/dequantize.cuh (1)
79-130: Q3_HIFI CUDA dequantization matches Q3_K layout with sensible outlier handlingThe Q3 bit unpacking (
qs/hmaskmath, index ranges, and-4offset) is consistent with the documented Q3_K layout, and the small unrolled loop overQ3_HIFI_OUTLIERSto restore FP16 outliers is straightforward and cheap on device. From a CUDA side this looks sound.Two minor follow‑ups you may want to sanity‑check against the CPU reference:
- Ensure
block_q3_hifi’s fields (d,qs,hmask,outlier_idx,outlier_vals) andQ3_HIFI_OUTLIERSexactly match these assumptions (Q3_K‑compatible packing plus a fixed outlier tail).- The comments disagree on the outlier count (“6 FP16 outliers” vs “only 8 per 256 weights”); consider aligning them to whatever
Q3_HIFI_OUTLIERSis actually set to, just to avoid confusion.ggml/src/ggml-cpu/quants.c (2)
582-582: Inconsistent FP16 conversion macro.This line uses
GGML_FP16_TO_FP32while other functions in this file (e.g.,ggml_vec_dot_q3_K_q8_K_genericat line 549) useGGML_CPU_FP16_TO_FP32. Use the CPU-specific macro for consistency.🔎 Suggested fix
- const float d = GGML_FP16_TO_FP32(xb->d) * yb->d; + const float d = GGML_CPU_FP16_TO_FP32(xb->d) * yb->d;
631-638: Same FP16 conversion inconsistency in outlier corrections.These lines also use
GGML_FP16_TO_FP32instead ofGGML_CPU_FP16_TO_FP32. Apply the same fix throughout the function.🔎 Suggested fix
- total_sum += GGML_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd; - total_sum += GGML_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[0]) * yb->qs[o_idx[0]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[1]) * yb->qs[o_idx[1]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[2]) * yb->qs[o_idx[2]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[3]) * yb->qs[o_idx[3]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[4]) * yb->qs[o_idx[4]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[5]) * yb->qs[o_idx[5]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[6]) * yb->qs[o_idx[6]] * yd; + total_sum += GGML_CPU_FP16_TO_FP32(o_vals[7]) * yb->qs[o_idx[7]] * yd;ggml/src/ggml-sycl/convert.cpp (1)
117-147: SYCL Q3_HIFI wiring matches existing Q3_K path; align the outlier-count comment.The new
dequantize_row_q3_hifi_syclwrapper and its use in bothggml_get_to_fp16_syclandggml_get_to_fp32_syclmirror the existingq3_Kimplementation and look correct.One minor nit: the comment at Line 117 says “Q3_HIFI: Q3_K-compatible layout with 6 FP16 outliers”, while the shared layout (via
block_q3_hifi/Q3_HIFI_OUTLIERS) and Python constants describe Q3_HIFI in terms of the genericQ3_HIFI_OUTLIERScount (documented as 8 outliers in other places). To avoid confusion if the constant ever changes, consider rephrasing the comment to referenceQ3_HIFI_OUTLIERSor to match the canonical outlier count used in the block definition.Also applies to: 574-576, 640-642
ggml/src/ggml-cuda/convert.cu (1)
555-566: Consider distributing outlier restoration across threads.Currently, only thread 0 handles all outliers while 63 threads idle after the barrier. For 6-8 outliers this overhead is minimal, but if
Q3_HIFI_OUTLIERSincreases, consider having multiple threads participate:// Synchronize before overwriting outliers __syncthreads(); - // Thread 0 handles outlier restoration - if (threadIdx.x == 0) { - dst_t * yb = yy + i*QK_K; - #pragma unroll - for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) { - const int idx = x[i].outlier_idx[k]; - yb[idx] = __half2float(x[i].outlier_vals[k]); - } - } + // Distribute outlier restoration across threads + if (threadIdx.x < Q3_HIFI_OUTLIERS) { + dst_t * yb = yy + i*QK_K; + const int k = threadIdx.x; + const int idx = x[i].outlier_idx[k]; + yb[idx] = __half2float(x[i].outlier_vals[k]); + }This avoids thread divergence and utilizes available parallelism.
tests/test-q3-hifi.py (1)
140-145: Consider usinglogging.exception()in except blocks for better diagnostics.Per static analysis hint,
logging.exception()automatically includes the stack trace, which can aid debugging. However, for straightforward errors like these, the current approach is acceptable.Optional improvement
try: perplexity_exe = find_executable("llama-perplexity", build_dir) except FileNotFoundError as e: - logging.error("Error: %s", e) + logging.exception("Could not find perplexity executable") logging.info("Make sure you've built llama.cpp first.") return 1ggml/src/ggml-cpu/arch/arm/quants.c (1)
4219-4219: Use symbolic constant instead of magic number 96.The hardcoded constant
96at line 4219 should be replaced with(Q3_HIFI_BLOCK_SIZE * 3 / 8)for clarity and maintainability. The value 96 represents the number of bytes needed to store 256 3-bit values.🔎 Proposed refactor
- if (bit_offset > 5 && byte_idx + 1 < 96) { + if (bit_offset > 5 && byte_idx + 1 < (Q3_HIFI_BLOCK_SIZE * 3 / 8)) { bits |= (qs[byte_idx + 1] << (8 - bit_offset)) & 7; }ggml/src/ggml-quants.c (3)
1398-1401: Layout compatibility assumption lacks compile-time verification.The cast from
block_q3_hifi*toblock_q3_K*relies on the comment at lines 1399-1400 stating "first 110 bytes are identical layout." Consider adding a compile-time assertion in the header to catch layout drift:static_assert(offsetof(block_q3_hifi, hmask) == offsetof(block_q3_K, hmask) && offsetof(block_q3_hifi, qs) == offsetof(block_q3_K, qs) && offsetof(block_q3_hifi, scales) == offsetof(block_q3_K, scales) && offsetof(block_q3_hifi, d) == offsetof(block_q3_K, d), "block_q3_hifi must be layout-compatible with block_q3_K");
5148-5151: Empty section - appears to be vestigial.These lines add a comment block but no implementation follows. If this was intended as a placeholder for future code, consider removing it or adding the intended content.
5463-5466: Validation only checksdfield, not outlier FP16 values.The
VALIDATE_ROW_DATA_D_F16_IMPLmacro validates the block'sdfield, butblock_q3_hifialso containsoutlier_vals[Q3_HIFI_OUTLIERS]which are FP16 values that could contain NaN/Inf. Consider extending validation:case GGML_TYPE_Q3_HIFI: { - VALIDATE_ROW_DATA_D_F16_IMPL(block_q3_hifi, data, nb); + const block_q3_hifi * q = (const block_q3_hifi *) data; + for (size_t i = 0; i < nb; ++i) { + if (!validate_fp16(q[i].d, i)) { + return false; + } + for (int k = 0; k < Q3_HIFI_OUTLIERS; ++k) { + if (!validate_fp16(q[i].outlier_vals[k], i)) { + return false; + } + } + } } break;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glslis excluded by!**/*.glslggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.glslis excluded by!**/*.glslggml/src/ggml-vulkan/vulkan-shaders/types.glslis excluded by!**/*.glsl
📒 Files selected for processing (43)
.gitignoreIMatrix_Guide.mdbenchmark_speed_test.ps1convert_hf_to_gguf.pydocs/quantization/Q3_HIFI.mdggml/include/ggml.hggml/src/ggml-common.hggml/src/ggml-cpu/arch/arm/quants.cggml/src/ggml-cpu/arch/x86/quants.cggml/src/ggml-cpu/ggml-cpu.cggml/src/ggml-cpu/ops.cppggml/src/ggml-cpu/quants.cggml/src/ggml-cpu/quants.hggml/src/ggml-cuda/common.cuhggml/src/ggml-cuda/convert.cuggml/src/ggml-cuda/dequantize.cuhggml/src/ggml-cuda/ggml-cuda.cuggml/src/ggml-cuda/mmq.cuggml/src/ggml-cuda/mmvq.cuggml/src/ggml-cuda/vecdotq.cuhggml/src/ggml-metal/ggml-metal-device.cppggml/src/ggml-metal/ggml-metal-impl.hggml/src/ggml-metal/ggml-metal.metalggml/src/ggml-quants.cggml/src/ggml-quants.hggml/src/ggml-sycl/convert.cppggml/src/ggml-sycl/dequantize.hppggml/src/ggml-sycl/mmvq.cppggml/src/ggml-sycl/vecdotq.hppggml/src/ggml-vulkan/ggml-vulkan.cppggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.compggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_iq1_s.compggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.compggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cppggml/src/ggml.cgguf-py/gguf/constants.pyinclude/llama.hsrc/llama-model-loader.cppsrc/llama-quant.cpptests/test-q3-hifi-text.txttests/test-q3-hifi.pytests/test-q3-hifi.shtools/quantize/quantize.cpp
🧰 Additional context used
📓 Path-based instructions (6)
.gitignore
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Never commit build artifacts (build/, .ccache/, *.o, *.gguf) or model files; ensure proper .gitignore configuration
Files:
.gitignore
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
ggml/src/ggml-vulkan/ggml-vulkan.cppsrc/llama-model-loader.cppggml/src/ggml-metal/ggml-metal-impl.hinclude/llama.hggml/include/ggml.htools/quantize/quantize.cppggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cppggml/src/ggml-sycl/convert.cppggml/src/ggml-cpu/ops.cppggml/src/ggml-common.hggml/src/ggml-quants.hsrc/llama-quant.cppggml/src/ggml-cpu/quants.hggml/src/ggml-metal/ggml-metal-device.cppggml/src/ggml-sycl/dequantize.hppggml/src/ggml-sycl/mmvq.cppggml/src/ggml-sycl/vecdotq.hpp
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
ggml/src/ggml-vulkan/ggml-vulkan.cppsrc/llama-model-loader.cppggml/src/ggml-metal/ggml-metal-impl.hinclude/llama.hggml/include/ggml.hgguf-py/gguf/constants.pytools/quantize/quantize.cppggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cppggml/src/ggml-sycl/convert.cppggml/src/ggml-cpu/ops.cppggml/src/ggml-common.hggml/src/ggml-quants.hsrc/llama-quant.cppggml/src/ggml-cpu/quants.htests/test-q3-hifi.pyggml/src/ggml-metal/ggml-metal-device.cppggml/src/ggml-sycl/dequantize.hppconvert_hf_to_gguf.pyggml/src/ggml-sycl/mmvq.cppggml/src/ggml-sycl/vecdotq.hpp
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/llama-model-loader.cppsrc/llama-quant.cpp
include/llama.h
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Exercise careful consideration when making changes to the public API in
include/llama.h, as API stability is critical
Files:
include/llama.h
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/constants.pytests/test-q3-hifi.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to .gitignore : Never commit build artifacts (build/, .ccache/, *.o, *.gguf) or model files; ensure proper .gitignore configuration
Applied to files:
.gitignore
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility
Applied to files:
src/llama-model-loader.cppinclude/llama.hsrc/llama-quant.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical
Applied to files:
include/llama.h
🧬 Code graph analysis (9)
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (2)
common/common.cpp (2)
string_ends_with(459-461)string_ends_with(459-459)common/common.h (1)
string_starts_with(627-630)
ggml/src/ggml-sycl/convert.cpp (1)
ggml/src/ggml-sycl/dequantize.hpp (3)
vx(18-18)dequantize_block_q3_hifi(350-423)dequantize_block_q3_hifi(350-351)
ggml/src/ggml-quants.h (2)
ggml/src/ggml-quants.c (3)
quantize_row_q3_hifi_ref(1281-1333)dequantize_row_q3_hifi(1390-1409)quantize_q3_hifi(1411-1424)ggml/src/ggml-cpu/arch/arm/quants.c (1)
dequantize_row_q3_hifi(4197-4256)
ggml/src/ggml-cuda/mmvq.cu (1)
ggml/src/ggml-sycl/vecdotq.hpp (1)
vec_dot_q3_hifi_q8_1(805-806)
ggml/src/ggml-cpu/quants.h (3)
ggml/src/ggml-cpu/quants.c (2)
quantize_row_q3_hifi(69-73)ggml_vec_dot_q3_hifi_q8_K_generic(558-642)ggml/src/ggml-cpu/arch/x86/quants.c (1)
ggml_vec_dot_q3_hifi_q8_K(2336-2485)ggml/src/ggml-cpu/arch/arm/quants.c (1)
ggml_vec_dot_q3_hifi_q8_K(2049-2187)
ggml/src/ggml-cpu/arch/arm/quants.c (1)
ggml/src/ggml-cpu/quants.c (1)
ggml_vec_dot_q3_hifi_q8_K_generic(558-642)
ggml/src/ggml-quants.c (1)
ggml/src/ggml.c (1)
ggml_row_size(1281-1284)
ggml/src/ggml-cpu/quants.c (1)
ggml/src/ggml-quants.c (1)
quantize_row_q3_hifi_ref(1281-1333)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
LlamaFileType(3233-3275)
🪛 LanguageTool
IMatrix_Guide.md
[style] ~361-~361: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...--no-ppl` ### Problem: IMatrix file is very large Solutions: - This is normal (can b...
(EN_WEAK_ADJECTIVE)
tests/test-q3-hifi-text.txt
[style] ~26-~26: Consider using a synonym to be more concise.
Context: ...startups have made significant progress in recent years. In 2019, Google announced quantum supr...
(IN_RECENT_STYLE)
🪛 markdownlint-cli2 (0.18.1)
docs/quantization/Q3_HIFI.md
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.10)
tests/test-q3-hifi.py
96-96: Avoid specifying long messages outside the exception class
(TRY003)
102-102: subprocess call: check for execution of untrusted input
(S603)
122-122: Avoid specifying long messages outside the exception class
(TRY003)
143-143: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
181-181: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🔇 Additional comments (56)
benchmark_speed_test.ps1 (2)
22-33: LGTM!Good defensive validation to fail fast if required files are missing.
268-296: LGTM!Export logic properly handles both summary statistics (CSV) and raw data (JSON) with timestamped filenames to prevent overwrites.
ggml/src/ggml-metal/ggml-metal-impl.h (1)
35-37: LGTM!The new constants
N_R0_Q3_HIFIandN_SG_Q3_HIFIare correctly positioned after the Q3_K definitions and follow the existing naming conventions. The values (2, 2) match Q3_K, which makes sense given Q3_HIFI is described as Q3_K-compatible.ggml/include/ggml.h (2)
379-381: Clear documentation for the new type.The comments clearly document that Q3_HIFI uses a Q3_K-compatible layout with 6 FP16 outliers, and that the block structure is defined in ggml-common.h for GPU backend compatibility. This is helpful for developers working with this format.
428-429: Correctly added at the end of the enum.The new
GGML_TYPE_Q3_HIFI = 40is properly added at the end of the enum as recommended by the comment on line 386, maintaining backward compatibility.GGML_TYPE_COUNTis correctly incremented to 41.Note: The
ggml_ftypeenum (lines 439-465) doesn't have a correspondingGGML_FTYPE_MOSTLY_Q3_HIFIentry. If Q3_HIFI models are intended to be saved as GGUF files with a file type indicator, you may need to add this entry. However, this might be handled elsewhere (e.g., in llama.cpp's file type mappings).ggml/src/ggml-sycl/dequantize.hpp (1)
348-423: LGTM - Q3_HIFI dequantization implementation.The implementation correctly follows the Q3_K dequantization pattern and adds outlier restoration logic. The two code paths for QK_K == 256 and other values are consistent with the existing Q3_K implementation.
IMatrix_Guide.md (1)
1-425: Excellent documentation for imatrix workflow.This is a well-structured and comprehensive guide covering the complete imatrix workflow. The sections on Q3_HIFI integration (lines 324-349) are particularly valuable for users of the new quantization format.
ggml/src/ggml-cuda/mmq.cu (1)
254-258: Comment-only clarification about Q3_HIFI routing looks correct.Annotating that Q3_HIFI is excluded from the MMQ path and handled via MMVQ/dequant keeps the control-flow intent clear without changing behavior.
ggml/src/ggml-cuda/ggml-cuda.cu (1)
4375-4401: Q3_HIFI enablement for CUDA matmul is consistent with other K-quant types.Including
GGML_TYPE_Q3_HIFIin the supportedMUL_MAT/MUL_MAT_IDweight types matches how other Q3/Q4_K variants are handled and aligns with the dedicated MMVQ/dequant path for Q3_HIFI.Please make sure end‑to‑end CUDA tests cover Q3_HIFI matmuls (both plain and
_ID) so this wiring is exercised across shapes/backends.tools/quantize/quantize.cpp (1)
46-47: New Q3_HIFI quant option wiring looks correct.The
QUANT_OPTIONSentry for"Q3_HIFI"cleanly maps toLLAMA_FTYPE_MOSTLY_Q3_HIFIand matches the descriptive intent of an adaptive Q3_HIFI/Q3_K/Q4_K mix. Just keep the bpw/behavior string aligned with the Q3_HIFI documentation as upstream numbers evolve.ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp (1)
45-70: Q3_HIFI wiring into Vulkan shader generator matches existing per-type patternsAdding
"q3_hifi"totype_namesand routing it through the specializedmul_mat_vec_q3_hifi.comppath mirrors how_kandiq*types are handled. It should participate correctly in dequant, matmul, andarr_dmmv_*generation without affecting existing types.Also applies to: 669-679
ggml/src/ggml-metal/ggml-metal-device.cpp (1)
637-646: Metal mul_mv/mul_mv_id correctly extend dispatch to Q3_HIFIThe new
GGML_TYPE_Q3_HIFIcases mirror the existingGGML_TYPE_Q3_Kmapping (settingnsg/nr0viaN_SG_Q3_HIFI/N_R0_Q3_HIFI) and leavesmemat the default, which matches nearby patterns. As long as these constants and kernels are defined consistently in the Metal backend, this wiring looks correct.Also applies to: 855-863
src/llama-model-loader.cpp (1)
29-64: Loader/display integration for LLAMA_FTYPE_MOSTLY_Q3_HIFI is coherentThe new
LLAMA_FTYPE_MOSTLY_Q3_HIFIcase in bothllama_model_ftype_name()and thetype_max→ftypeswitch keeps the guessed file type and the human‑readable label in sync withGGML_TYPE_Q3_HIFI. Since GGUF’sgeneral.file_typecan still override the guess, this is a safe, backwards‑compatible extension.Also applies to: 641-672
ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_iq1_s.comp (1)
10-51: IQ1_S Vulkan mat‑vec shader changes are cosmetic/optimization-onlyThe added
[[unroll]]attributes and minor reshuffling of temporaries incalc_superblock()preserve the original math (same bit patterns, fma/dot usage, and accumulation intotemp[j][n]). Functionally this should be a no-op, with potential small perf wins depending on the driver.ggml/src/ggml-sycl/mmvq.cpp (1)
718-739: *SYCL MMVQ: Q3_HIFI mat‑vec path is consistent with existing _K implementations
mul_mat_vec_q3_hifi_q8_1_sycl()reuses the genericmul_mat_vec_qwithQK_K/QI3_K, a dedicatedblock_q3_hifi, andvec_dot_q3_hifi_q8_1, and the launcher mirrors the other *_K variants. The dispatcher’s newGGML_TYPE_Q3_HIFIcase correctly routes Q3_HIFI tensors here, so the SYCL backend should now handle this quantization type end‑to‑end.Also applies to: 1099-1101
ggml/src/ggml-cuda/common.cuh (1)
828-833: Q3_HIFI CUDA traits correctly mirror Q3_K geometry
ggml_cuda_type_traits<GGML_TYPE_Q3_HIFI>reusesQK_K/QR3_K/QI3_K, which keeps the CUDA block geometry aligned with existing Q3_K kernels and matches the “Q3_K‑compatible” layout intent. No issues from this specialization alone.ggml/src/ggml-cpu/ggml-cpu.c (1)
282-287: LGTM! Q3_HIFI type trait correctly registered.The new type trait entry for
GGML_TYPE_Q3_HIFIfollows the established pattern and correctly references the quantization and vector dot-product functions. The configuration (usingGGML_TYPE_Q8_Kas vec_dot_type andnrows = 1) is consistent with the similarQ3_Ktype.ggml/src/ggml-cuda/mmvq.cu (3)
20-20: LGTM! Q3_HIFI vector dot dispatch added correctly.The dispatch for
vec_dot_q3_hifi_q8_1is properly placed and follows the established pattern for other quantization types.
47-47: Verify that Q3_HIFI can safely reuse Q3_K's VDR constant.Line 47 returns
VDR_Q3_K_Q8_1_MMVQforGGML_TYPE_Q3_HIFIwith a comment indicating it's the same as Q3_K. While this may be intentional, please confirm that Q3_HIFI's block structure and quantization layout are compatible with Q3_K's vector dot reduction parameter. If they differ structurally (e.g., different numbers of outliers or quant layouts), a dedicatedVDR_Q3_HIFI_Q8_1_MMVQconstant might be needed.Based on learnings, please note: This is a mirrored PR, so any issues should be reported to ngxson to relay to the upstream contributor.
529-534: LGTM! Q3_HIFI switch case properly implemented.The switch case for
GGML_TYPE_Q3_HIFIcorrectly follows the established pattern and will properly dispatch to the Q3_HIFI-specific matrix-vector multiplication path.ggml/src/ggml-cpu/ops.cpp (7)
667-692: Add: Q3_HIFI wired into quantized add path correctlyIncluding
GGML_TYPE_Q3_HIFIin the quantized branch reusesggml_compute_forward_add_q_f32, aligned with other Q* and IQ* formats; no issues spotted.
1116-1142: Add1: Q3_HIFI shares the existing quantized scalar-add implementation
GGML_TYPE_Q3_HIFIis dispatched toggml_compute_forward_add1_q_f32alongside other quantized types, which is consistent with the existing design.
1244-1273: Acc: Q3_HIFI explicitly marked unsupported like other quant/low-prec typesAdding
GGML_TYPE_Q3_HIFIinto the grouped cases that end inGGML_ABORT("fatal error")keeps behavior consistent with other non‑F32 types forggml_compute_forward_acc.
4269-4295: Out_prod: Q3_HIFI routed through quantized out_prod_q_f32Dispatching
GGML_TYPE_Q3_HIFItoggml_compute_forward_out_prod_q_f32mirrors handling ofQ3_Kand other quantized types and relies on the newto_floattrait; looks correct.
4544-4572: Set: Q3_HIFI added to unsupported quantized/set group
GGML_TYPE_Q3_HIFIis listed among the quantized types that currently cause a fatal abort forggml_compute_forward_set, matching existing behavior for other q-types.
4767-4794: Get_rows: Q3_HIFI enabled for quantized get_rows_q dequantizationIncluding
GGML_TYPE_Q3_HIFIin the quantized branch so it usesggml_compute_forward_get_rows_qis consistent with the rest of the quantized formats.
5493-5527: Clamp: Q3_HIFI explicitly treated as unsupported in clamp switchAdding
GGML_TYPE_Q3_HIFIto the clamp switch’s quantized/unsupported group ensures it’s rejected the same way as other non‑F32/F16 types.tests/test-q3-hifi-text.txt (1)
1-46: LGTM - Good test data variety.The combination of simple narrative (children's story) and complex technical content (quantum computing exposition) provides a good range of text complexity for perplexity validation testing of the Q3_HIFI quantization format.
ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp (1)
41-56: LGTM - Outlier handling is GPU-appropriate.The fully unrolled outlier check without early-exit is the correct approach for GPU shaders to avoid divergent control flow. The Q3_K dequantization logic followed by outlier substitution matches the CPU implementation pattern.
ggml/src/ggml-cpu/quants.c (2)
69-73: LGTM!The
quantize_row_q3_hifiwrapper follows the established pattern of other quantization functions in this file, correctly delegating to the reference implementation after validating block alignment.
556-564: LGTM - Function structure follows established patterns.The generic Q3_HIFI dot-product implementation correctly follows the Q3_K pattern with appropriate assertions, setup, and the addition of outlier corrections. The unrolled outlier handling is a reasonable performance choice for the generic fallback.
ggml/src/ggml-sycl/vecdotq.hpp (1)
801-855: Outlier-aware Q3_HIFI dot path looks consistent; please double-check the per-thread outlier gating.The structure of
vec_dot_q3_hifi_q8_1is good: it reuses the Q3_K bulk kernel and then adds an explicit outlier correction term based onoutlier_idx/outlier_vals, which is exactly what we want for Q3_HIFI.The only part that’s hard to validate locally is the per-thread gating logic for outliers:
const int idx_bq8 = idx / QK8_1; const int idx_in_bq8 = idx % QK8_1; ... if (idx_bq8 >= bq8_offset && idx_bq8 < bq8_offset + QR3_K) { const int thread_q8_offset = iqs % QI8_1; const int pos_in_q8_group = idx_in_bq8 / 4; if (pos_in_q8_group == thread_q8_offset) { ... sum += outlier_val * q8_val * d8_val; } }Given how
iqs,QR3_K,QI3_K,QK8_1, andQI8_1interact in the MMVQ kernels, it would be good to confirm that:
- Every valid outlier index is visited by exactly one
(block, iqs)thread.- No outlier is silently skipped because
pos_in_q8_group/thread_q8_offsetdon’t line up with the actual packing inbq8_1[*].qs.You can verify this by diffing this function against the upstream CUDA
vec_dot_q3_hifi_q8_1implementation and by running the Q3_HIFI tests/benchmarks from the upstream PR (including any SYCL-specific tests) to ensure dot products match the reference CPU path across a range of shapes and random seeds. If there is any discrepancy, the bug is likely in thepos_in_q8_groupvsthread_q8_offsetmapping.gguf-py/gguf/constants.py (1)
3220-3220: Q3_HIFI enum and size wiring look consistent; keep them in sync with ggml headers.The additions
GGMLQuantizationType.Q3_HIFI = 41LlamaFileType.MOSTLY_Q3_HIFI = 41GGML_QUANT_SIZES[Q3_HIFI] = (256, 134) # Q3_K (110 bytes) + outlier_idx[8] + outlier_vals[16]are internally consistent and match the documented Q3_HIFI layout (Q3_K base + 8 outliers).
The only follow-up I’d suggest is to double-check that:
GGML_TYPE_Q3_HIFIinggml.halso uses id 41, andblock_q3_hifi’s sizeof matches134bytes,to avoid silent GGUF/C-side drift when upstream tweaks the layout.
You can confirm this quickly by checking the upstream
ggml.handggml-common.hfor theGGML_TYPE_Q3_HIFIenum value and the definition ofblock_q3_hifi, and ensuring the size expression evaluates to 134 forQK_K = 256. Based on learnings, please relay any mismatch upstream via ngxson.Also applies to: 3272-3274, 3370-3371
ggml/src/ggml-quants.h (1)
33-34: Q3_HIFI quant/dequant API surface is consistent with existing K-quants.The new declarations for
quantize_row_q3_hifi_ref(...)dequantize_row_q3_hifi(...)quantize_q3_hifi(...)are shaped and placed consistently with the existing K-quant APIs (
q3_K,q4_K, etc.), which keeps the CPU backend and quantization dispatcher uniform.As long as the implementations in
ggml-quants.cand the arch-specific backends (AVX2, NEON, etc.) match these signatures, this header change looks good.If you haven’t already, it’s worth grepping for
quantize_row_q3_hifi_ref,dequantize_row_q3_hifi, andquantize_q3_hifiacross the repo to ensure:
- All declarations and definitions agree on parameter types and order.
- No backend (e.g. CUDA/Metal/SYCL callers) still assumes only the older Q3_K entry points.
Also applies to: 106-108
ggml/src/ggml-cuda/convert.cu (2)
569-573: LGTM!The launcher follows the established Q3_K pattern correctly.
732-733: LGTM!Dispatch integration follows the established pattern and is correctly placed after Q3_K.
Also applies to: 785-786
ggml/src/ggml-cpu/arch/x86/quants.c (4)
2334-2351: LGTM!Function signature and setup follow the established Q3_K pattern correctly.
2353-2455: LGTM!The AVX2 implementation correctly mirrors the Q3_K kernel while properly handling the
ggml_fp16_tscale field specific to Q3_HIFI.
2481-2484: LGTM!The fallback to the generic implementation follows the established pattern.
3974-3975: The comment references a non-existent function.The function
dequantize_row_q3_hifidoes not exist. The correct function name for Q3_K dequantization isdequantize_row_q3_K, notdequantize_row_q3_hifi. Update the comment to reference the correct function name.Likely an incorrect or invalid review comment.
ggml/src/ggml-common.h (1)
291-306: Well-designed struct with Q3_K compatibility.The approach of keeping the first 110 bytes identical to
block_q3_Kand appending the outlier extension is sound for kernel reuse. The compile-timestatic_assertprovides good validation of the expected layout. The "DO NOT REORDER" comment is helpful for future maintainers.tests/test-q3-hifi.py (1)
161-175: Sensible test configuration for quick validation.Using
-c 256and--chunks 2keeps the test fast for CI while still validating that the Q3_HIFI format works correctly. Combining stdout/stderr ensures PPL extraction works regardless of where the output goes.ggml/src/ggml-cpu/arch/arm/quants.c (2)
2158-2173: Update comment to reflect 8 outliers, not 6.Line 2158 incorrectly states "fully unrolled for 6 outliers" while the code processes 8 (indices 0-7 at lines 2166-2173), matching the Q3_HIFI_OUTLIERS constant. Also verify the macro inconsistency: the main loop uses
GGML_CPU_FP16_TO_FP32(line 2086) while outlier correction usesGGML_FP16_TO_FP32(lines 2166-2173)—both within the same ARM NEON function, suggesting a potential copy-paste error.
4195-4257: The ARM NEON implementation already has fallback coverage. The generic implementation inggml/src/ggml-quants.cis compiled into the main library and automatically available on all architectures, including non-NEON ARM systems. The absence of an explicit#elseclause in the ARM-specific file is not problematic for linking. However, adding a comment clarifying that the generic implementation provides the fallback would improve code clarity.ggml/src/ggml-quants.c (3)
1281-1333: LGTM - Reference quantization implementation is sound.The outlier selection using magnitude-based ranking and masking with
-1.0fworks correctly sincefabsf()ensures non-negative values. The block-level processing and field copying are appropriate.
1371-1373: Bulk quantization uses_refinstead of_impl- verify intentional.The weighted magnitude at line 1347 affects outlier selection, but
quantize_row_q3_K_ref(not_impl) is called for bulk quantization. This meansquant_weightsdon't influence the Q3_K portion. If weights should affect bulk quantization quality, consider:- quantize_row_q3_K_ref(tmp, &q3k_block, Q3_HIFI_BLOCK_SIZE); + // Note: weights only affect outlier selection, not bulk Q3_K quantization + quantize_row_q3_K_ref(tmp, &q3k_block, Q3_HIFI_BLOCK_SIZE);Or use
quantize_row_q3_K_implif bulk should be weight-aware.
1411-1424: LGTM - Entry point follows established patterns.The routing logic and row-by-row processing match other
quantize_q*_Kfunctions in the file.ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp (3)
24-31: Barrier placement in non-uniform execution path.The barrier at line 27 is inside a conditional block (
if (!all_threads)). Whenall_threadsis false, some invocations may skip the barrier while others execute it, which is undefined behavior in GLSL. However, looking at the logic:
- When
!all_threads, barrier is at line 27- When
all_threads, barrier is at line 48This appears correct since the condition is uniform across the workgroup (determined by loop bounds). The early
continueat line 30 is after the barrier, which is safe.Please verify that
all_threadsis indeed uniform across all invocations in the workgroup, as non-uniform barrier execution would cause undefined behavior.
81-122: LGTM - Standard mat-vec computation structure.The work partitioning between full iterations (
all_threads=true) and partial remainder (all_threads=false) correctly handles non-power-of-2 block counts.
124-135: LGTM - Boundary handling is correct.The shader properly handles partial workgroups at the tensor boundary by computing the remaining rows when
first_row + NUM_ROWS > p.stride_d.ggml/src/ggml-vulkan/ggml-vulkan.cpp (3)
3583-3583: LGTM!The pipeline creation for Q3_HIFI mul_mat_vec operations follows the established K-quant pattern correctly, using the appropriate
reduc16variants and workgroup configurations consistent with Q3_K.Also applies to: 3608-3608
3705-3705: LGTM!The dequantization and get_rows pipeline registrations for Q3_HIFI are correctly added with appropriate workgroup sizes matching the K-quant pattern. The code references the expected shader data arrays (
dequant_q3_hifi_*,get_rows_q3_hifi_*) which should be generated from the corresponding shader files.Also applies to: 3731-3731, 3757-3757
5407-5407: LGTM!All switch statement additions correctly include
GGML_TYPE_Q3_HIFIin the fallthrough blocks alongside other K-quant types. This ensures proper operator support validation and pipeline selection for the new quantization type. The ordering is consistent (always placed after Q3_K) across all locations.Also applies to: 5479-5479, 5543-5543, 5634-5634, 5701-5701, 13850-13850, 13971-13971
ggml/src/ggml-metal/ggml-metal.metal (3)
9693-9693: Q3_HIFIget_rowswiring looks consistent but depends ondequantize_q3_hificorrectnessRegistering
kernel_get_rows_q3_hifiaskernel_get_rows_q<block_q3_hifi, QK_NL, dequantize_q3_hifi>mirrors the existing Q3_K path. Oncedequantize_q3_hifiis fixed/confirmed, this instantiation should behave correctly and does not need further changes.
9756-9757:kernel_mul_mm_q3_hifi_{f32,f16}instantiations correctly mirror Q3_K pathThe new mat‑mat kernels for Q3_HIFI (
kernel_mul_mm_q3_hifi_f32and_f16) hook into the generickernel_mul_mmtemplate usingblock_q3_hifianddequantize_q3_hifi, consistent with the existing Q3_K registrations. Provideddequantize_q3_hifimatches the reference semantics, these registrations look good.Also applies to: 9783-9784
9816-9817: Q3_HIFI indirect matmul/matvec registrations look fine but rely on core kernel correctness
kernel_mul_mm_id_q3_hifi_{f32,f16}reuse the samemul_mm_idtemplate as other Q3_* types, withblock_q3_hifianddequantize_q3_hifi.kernel_mul_mv_id_q3_hifi_f32wireskernel_mul_mv_q3_hifi_f32_impl<N_R0_Q3_HIFI>through the existingmmv_fnindirection.These are mechanically consistent with the other quant types. Once the base
dequantize_q3_hifiandkernel_mul_mv_q3_hifi_f32_implsemantics are confirmed/fixed, these indirect paths should work as expected.Also applies to: 9843-9844, 9999-10000
| # Find the fastest model for comparison | ||
| $FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum | ||
|
|
||
| # Detailed results table | ||
| Write-Host "SPEED COMPARISON (tokens/second - higher is better)" -ForegroundColor Yellow | ||
| Write-Host "-" * 70 | ||
|
|
||
| $TableHeader = "{0,-15} {1,10} {2,10} {3,10} {4,10} {5,10} {6,10}" -f "Model", "Mean", "StdDev", "Median", "Min", "Max", "vs Best" | ||
| Write-Host $TableHeader -ForegroundColor White | ||
| Write-Host "-" * 70 | ||
|
|
||
| foreach ($model in $Models) { | ||
| $stats = $AllStats[$model.Name] | ||
| $vsBest = if ($stats.Mean -eq $FastestMean) { "FASTEST" } else { | ||
| "-" + [math]::Round((1 - $stats.Mean / $FastestMean) * 100, 1) + "%" | ||
| } | ||
|
|
||
| $row = "{0,-15} {1,10:F2} {2,10:F2} {3,10:F2} {4,10:F2} {5,10:F2} {6,10}" -f ` | ||
| $model.Name, $stats.Mean, $stats.StdDev, $stats.Median, $stats.Min, $stats.Max, $vsBest | ||
|
|
||
| if ($stats.Mean -eq $FastestMean) { | ||
| Write-Host $row -ForegroundColor Green | ||
| } else { | ||
| Write-Host $row | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential division by zero.
If all benchmark runs fail (no speeds collected), $FastestMean will be 0, causing a division-by-zero error on Line 210. This will crash the script instead of showing that all benchmarks failed.
🔎 Proposed fix
# Find the fastest model for comparison
-$FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum
+$FastestMean = ($AllStats.Values | ForEach-Object { $_.Mean } | Measure-Object -Maximum).Maximum
+if ($FastestMean -eq 0) {
+ Write-Warning "All benchmarks failed - no valid speed data collected"
+ $FastestMean = 1 # Prevent division by zero
+}
# Detailed results table
Write-Host "SPEED COMPARISON (tokens/second - higher is better)" -ForegroundColor Yellow| "--outtype", type=str, choices=["f32", "f16", "bf16", "q8_0", "tq1_0", "tq2_0", "q3_hifi", "auto"], default="f16", | ||
| help="output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, q3_hifi for Q3_HIFI (3-bit with outliers), and auto for the highest-fidelity 16-bit float type depending on the first loaded tensor type", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--outtype q3_hifi currently leads to a runtime error (no quant mapping for MOSTLY_Q3_HIFI).
You’ve exposed q3_hifi in the CLI and ftype_map, but prepare_tensors() never maps gguf.LlamaFileType.MOSTLY_Q3_HIFI to a gguf.GGMLQuantizationType. With the current code, running:
python convert_hf_to_gguf.py --outtype q3_hifi ...will hit the else: raise ValueError(f"Unknown file type: {self.ftype.name}") branch in prepare_tensors() when data_qtype is still False.
You probably want to wire Q3_HIFI through the quantization mapping so the new outtype is usable.
Proposed fix: map MOSTLY_Q3_HIFI to GGMLQuantizationType.Q3_HIFI
Add a case alongside the other file types in ModelBase.prepare_tensors:
if isinstance(data_qtype, bool):
if self.ftype == gguf.LlamaFileType.ALL_F32:
data_qtype = gguf.GGMLQuantizationType.F32
elif self.ftype == gguf.LlamaFileType.MOSTLY_F16:
data_qtype = gguf.GGMLQuantizationType.F16
elif self.ftype == gguf.LlamaFileType.MOSTLY_BF16:
data_qtype = gguf.GGMLQuantizationType.BF16
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0:
data_qtype = gguf.GGMLQuantizationType.Q8_0
+ elif self.ftype == gguf.LlamaFileType.MOSTLY_Q3_HIFI:
+ data_qtype = gguf.GGMLQuantizationType.Q3_HIFI
elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ1_0:
data_qtype = gguf.GGMLQuantizationType.TQ1_0
elif self.ftype == gguf.LlamaFileType.MOSTLY_TQ2_0:
data_qtype = gguf.GGMLQuantizationType.TQ2_0
else:
raise ValueError(f"Unknown file type: {self.ftype.name}")(Adjust the enum name if the quantization type is spelled differently in gguf.GGMLQuantizationType.)
Since this repo mirrors upstream, you may want to pass this concern along to the upstream PR author as well. Based on learnings, this should be relayed via ngxson to the original contributor.
Also applies to: 10583-10592
🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 10425-10427 (and similarly 10583-10592),
the new CLI outtype "q3_hifi" is exposed but prepare_tensors() never maps
gguf.LlamaFileType.MOSTLY_Q3_HIFI to a GGML quantization type, causing the
"Unknown file type" error; update the file-type -> quantization mapping in
ModelBase.prepare_tensors to include a branch that maps
gguf.LlamaFileType.MOSTLY_Q3_HIFI to gguf.GGMLQuantizationType.Q3_HIFI (or the
exact enum name used by gguf), mirroring the pattern used for the other file
types so q3_hifi runs without raising the ValueError.
| ``` | ||
| Perplexity Improvement (Q3_HIFI vs Q3_K_M) | ||
| ═══════════════════════════════════════════════════════ | ||
| 0.6B ████████████████████████████████████ -16.4% | ||
| 1.7B ██████████████████████████████████████████ -21.4% | ||
| 4B ██████████████████ -7.3% | ||
| 8B ███████████ -4.4% | ||
| 14B ████ -1.6% | ||
| 32B █████ -2.0% | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a language tag to the fenced code block (MD040).
The fenced block starting at line 64 uses without a language, triggering markdownlint MD040. Consider using something like `text (or ```none `) to satisfy tooling and keep rendering consistent.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/quantization/Q3_HIFI.md around lines 64 to 73, the fenced code block
lacks a language tag which triggers markdownlint MD040; update the opening fence
to include a neutral language (e.g., "text" or "none") so the block is
explicitly marked as plain text, ensuring the linter passes and rendering stays
consistent.
| // === OUTLIER EXTENSION (18 bytes) === | ||
| uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 6 bytes: outlier positions (0-255) | ||
| ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments have incorrect byte sizes for outlier extension.
The comments don't match the actual sizes based on Q3_HIFI_OUTLIERS = 8:
- Line 302: Comment says "18 bytes" but extension is actually 24 bytes (8 + 16)
- Line 303: Comment says "6 bytes" but
outlier_idx[8]is 8 bytes - Line 304: Comment says "12 bytes" but
outlier_vals[8]× 2 bytes = 16 bytes
The code and static_assert are correct; only the comments need updating.
Suggested fix
- // === OUTLIER EXTENSION (18 bytes) ===
- uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 6 bytes: outlier positions (0-255)
- ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values
+ // === OUTLIER EXTENSION (24 bytes) ===
+ uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 8 bytes: outlier positions (0-255)
+ ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 16 bytes: FP16 outlier values📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // === OUTLIER EXTENSION (18 bytes) === | |
| uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 6 bytes: outlier positions (0-255) | |
| ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 12 bytes: FP16 outlier values | |
| // === OUTLIER EXTENSION (24 bytes) === | |
| uint8_t outlier_idx[Q3_HIFI_OUTLIERS]; // 8 bytes: outlier positions (0-255) | |
| ggml_half outlier_vals[Q3_HIFI_OUTLIERS]; // 16 bytes: FP16 outlier values |
🤖 Prompt for AI Agents
In ggml/src/ggml-common.h around lines 302 to 304, the inline comments for the
outlier extension show incorrect byte counts; update them to match
Q3_HIFI_OUTLIERS = 8 so the extension is 24 bytes, outlier_idx[Q3_HIFI_OUTLIERS]
is 8 bytes, and outlier_vals[Q3_HIFI_OUTLIERS] (ggml_half) is 16 bytes; leave
the code and static_assert untouched and only change the comment texts to the
correct sizes.
| // Q3_HIFI: Add outlier corrections | ||
| // Fully unrolled loop for 6 outliers - eliminates loop overhead | ||
| // Note: We tried branchless masking but the computation cost outweighs | ||
| // any branch misprediction savings for only 6 outliers per block. | ||
| for (int i = 0; i < nb; ++i) { | ||
| const float d_y = y[i].d; | ||
| const int8_t * GGML_RESTRICT q8 = y[i].qs; | ||
| const uint8_t * GGML_RESTRICT idx = x[i].outlier_idx; | ||
| const ggml_fp16_t * GGML_RESTRICT vals = x[i].outlier_vals; | ||
|
|
||
| // Unrolled: process all 8 outliers without loop overhead | ||
| // Using FMA-friendly pattern: accumulate (w * a) * d_y | ||
| sumf += GGML_FP16_TO_FP32(vals[0]) * (float)q8[idx[0]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[1]) * (float)q8[idx[1]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[2]) * (float)q8[idx[2]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[3]) * (float)q8[idx[3]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[4]) * (float)q8[idx[4]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[5]) * (float)q8[idx[5]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[6]) * (float)q8[idx[6]] * d_y; | ||
| sumf += GGML_FP16_TO_FP32(vals[7]) * (float)q8[idx[7]] * d_y; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for Q3_HIFI_OUTLIERS constant definition
rg -n 'Q3_HIFI_OUTLIERS' --type c --type h -C2Repository: ngxson/llama.cpp
Length of output: 5216
🏁 Script executed:
#!/bin/bash
# Search for outlier_idx and outlier_vals declarations to check array sizes
rg -n 'outlier_idx|outlier_vals' --type c --type h -B2 -A2Repository: ngxson/llama.cpp
Length of output: 5851
🏁 Script executed:
#!/bin/bash
# Search for Q3_HIFI struct definition
rg -n 'struct.*Q3_HIFI|{.*outlier_idx.*outlier_vals' --type c --type h -C5Repository: ngxson/llama.cpp
Length of output: 195
🏁 Script executed:
#!/bin/bash
# Get the exact lines around 2457-2477 from the x86 quants.c file
sed -n '2457,2477p' ggml/src/ggml-cpu/arch/x86/quants.c | cat -nRepository: ngxson/llama.cpp
Length of output: 1409
Comments reference "6 outliers" but code processes 8.
Lines 2458 and 2460 mention "6 outliers" while the loop unrolls exactly 8 iterations (lines 2469-2476) and Q3_HIFI_OUTLIERS is defined as 8. Update the comments:
- // Fully unrolled loop for 6 outliers - eliminates loop overhead
- // Note: We tried branchless masking but the computation cost outweighs
- // any branch misprediction savings for only 6 outliers per block.
+ // Fully unrolled loop for 8 outliers - eliminates loop overhead
+ // Note: We tried branchless masking but the computation cost outweighs
+ // any branch misprediction savings for only 8 outliers per block.The code is correct; only the comments need updating to match Q3_HIFI_OUTLIERS.
🤖 Prompt for AI Agents
In ggml/src/ggml-cpu/arch/x86/quants.c around lines 2457 to 2477, the inline
comments incorrectly reference "6 outliers" while the loop unroll processes 8
outliers (and Q3_HIFI_OUTLIERS is defined as 8); update the comments to say "8
outliers" (or reference Q3_HIFI_OUTLIERS) and remove mention of 6 to match the
actual unrolled code and constant, e.g., change the comment lines that read "6
outliers" to "8 outliers" or "Q3_HIFI_OUTLIERS outliers" so the comments
accurately reflect the implementation.
| // Q3_HIFI dequantization shader | ||
| // Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment states 6 outliers, but implementation uses 8.
The comment says "6 FP16 outliers" but line 49 loops over Q3_HIFI_OUTLIERS which is defined as 8 elsewhere in the codebase. Update the comment to match the actual implementation.
🔎 Suggested fix
// Q3_HIFI dequantization shader
-// Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers
+// Uses Q3_K-compatible layout (hmask + qs + scales) with 8 FP16 outliers📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Q3_HIFI dequantization shader | |
| // Uses Q3_K-compatible layout (hmask + qs + scales) with 6 FP16 outliers | |
| // Q3_HIFI dequantization shader | |
| // Uses Q3_K-compatible layout (hmask + qs + scales) with 8 FP16 outliers |
🤖 Prompt for AI Agents
In ggml/src/ggml-vulkan/vulkan-shaders/dequant_q3_hifi.comp around lines 3 to 4,
the header comment states "6 FP16 outliers" but the implementation loops over
Q3_HIFI_OUTLIERS which is defined as 8; update the comment to reflect the actual
implementation (e.g., change "6 FP16 outliers" to "8 FP16 outliers" or remove
the hardcoded count) so the comment matches the code.
| // Q3_HIFI matrix-vector multiplication shader | ||
| // Uses Q3_K-compatible layout, outlier correction skipped on GPU for simplicity | ||
| // (outliers are still applied on CPU for full quality) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU quality differs from CPU - outlier correction skipped.
The comment explicitly states outlier correction is skipped on GPU. This is a significant quality difference since outliers are the defining feature of Q3_HIFI that distinguishes it from Q3_K. Users may not expect different results between CPU and GPU backends.
Consider either:
- Implementing outlier correction in the shader (adds ~8 FMA ops per block)
- Documenting this limitation prominently in user-facing documentation
- Adding a warning when Q3_HIFI models are used on Vulkan
🤖 Prompt for AI Agents
In ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q3_hifi.comp around lines 4-6
the comment notes outlier correction is skipped on GPU causing Q3_HIFI results
to differ from CPU; resolve by either (A) implement outlier correction in the
shader by adding the per-block outlier adjustment: load outlier values per
block, apply the ~8 FMA operations per block after the core dot-product
accumulation and before writing results, taking care to preserve precision and
memory alignment and testing performance impact; or (B) if shader implementation
is unacceptable, add a prominent user-facing change: update documentation to
clearly state GPU backend skips outlier correction for Q3_HIFI and add a runtime
warning emitted when a Q3_HIFI model is loaded on Vulkan to notify users of
potential quality differences.
| else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_HIFI) { | ||
| // Adaptive Q3_HIFI: use Q3_HIFI for ALL attn_v layers (consistently sensitive) | ||
| new_type = GGML_TYPE_Q3_HIFI; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adaptive Q3_HIFI policy is coherent, but fallback handling for GGML_TYPE_Q3_HIFI is missing.
The new ftype wiring for LLAMA_FTYPE_MOSTLY_Q3_HIFI looks logically consistent:
attn_v.weight→ alwaysGGML_TYPE_Q3_HIFI(these are known to be sensitive).ffn_down→ first ~1/3 of layersQ3_HIFI, middle bandQ4_Kwhenuse_more_bitsis true, restQ3_K.attn_output.weightandattn_qkv.weight→ treated asQ4_K.- Default type for this ftype is
GGML_TYPE_Q3_K, matching the “Q3_K base, Q3_HIFI on sensitive layers” comment.
However, in the convert_incompatible_tensor block at the end of llama_tensor_get_type, there is no case for GGML_TYPE_Q3_HIFI. If a tensor assigned Q3_HIFI ends up with nx % ggml_blck_size(Q3_HIFI) != 0, the switch will hit the default branch and throw:
default: throw std::runtime_error("\nUnsupported tensor size encountered\n");Existing K-quants (including GGML_TYPE_Q3_K) instead fall back to a more permissive format (e.g., IQ4_NL) in this situation.
To make Q3_HIFI robust to odd geometries and consistent with the rest of the K-quants, add an explicit fallback case for GGML_TYPE_Q3_HIFI, e.g.:
Proposed fallback mapping for GGML_TYPE_Q3_HIFI
- switch (new_type) {
+ switch (new_type) {
case GGML_TYPE_TQ1_0:
case GGML_TYPE_TQ2_0: new_type = GGML_TYPE_Q4_0; break; // TODO: use a symmetric type instead
case GGML_TYPE_IQ2_XXS:
@@
- case GGML_TYPE_Q2_K:
- case GGML_TYPE_Q3_K:
+ case GGML_TYPE_Q2_K:
+ case GGML_TYPE_Q3_K:
+ case GGML_TYPE_Q3_HIFI:
case GGML_TYPE_IQ4_XS: new_type = GGML_TYPE_IQ4_NL; break;This preserves the “Q3_HIFI as a K-quant” behavior for valid shapes, while giving you a graceful degradation path instead of a hard failure for incompatible tensor sizes. Based on learnings, this is exactly the sort of issue that should be relayed upstream.
Also applies to: 355-360, 404-405, 413-415, 586-587
🤖 Prompt for AI Agents
In src/llama-quant.cpp around lines 298-301 (and similarly at the other affected
ranges 355-360, 404-405, 413-415, 586-587), the convert_incompatible_tensor
switch lacks a case for GGML_TYPE_Q3_HIFI which causes a runtime throw when nx %
ggml_blck_size(GGML_TYPE_Q3_HIFI) != 0; add an explicit case for
GGML_TYPE_Q3_HIFI that mirrors the existing fallback used for K-quant types
(e.g., map it to the same permissive fallback format used for GGML_TYPE_Q3_K /
other K-quant fallbacks such as the IQ4_NL-equivalent branch), so tensors that
cannot use packed Q3_HIFI gracefully degrade instead of hitting the default
error path.
| import logging | ||
|
|
||
| # Configuration | ||
| PPL_THRESHOLD = 25.0 # Reasonable threshold for 3-bit quantization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing logging.basicConfig() - info/debug messages won't appear.
"The handler's level is set to WARNING, so all events at this and greater severities will be output." Since there's no logging.basicConfig() call, all logging.info() and logging.debug() calls (lines 101, 147, 148, 154, 162, 183, 184, 187) will be silently dropped. The script will appear to run silently on success, only showing output on errors.
Suggested fix
Add configuration after the imports:
import logging
+# Configure logging to show info-level messages
+logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
+
# Configuration
PPL_THRESHOLD = 25.0 # Reasonable threshold for 3-bit quantization🤖 Prompt for AI Agents
In tests/test-q3-hifi.py around lines 23 to 26, logging is imported but no basic
configuration is set so info/debug messages are dropped; add a
logging.basicConfig() call immediately after the imports to set an appropriate
level (INFO or DEBUG) and a simple format so logging.info()/debug() calls (lines
101, 147, 148, 154, 162, 183, 184, 187) are emitted during successful runs.
| PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "") | ||
|
|
||
| if [ -z "$PPL" ]; then | ||
| # Try alternate format: just look for the last PPL value | ||
| PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "") | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grep -oP is not portable to macOS.
The -P (Perl regex) flag is a GNU grep extension and is not available on macOS's BSD grep by default. This will cause the test to fail on macOS.
Consider using a more portable approach:
🔎 Proposed portable fix
# Extract final perplexity value
# Format: "Final estimate: PPL = X.XXXX +/- Y.YYYY"
-PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "")
+PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*Final estimate: PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)
if [ -z "$PPL" ]; then
# Try alternate format: just look for the last PPL value
- PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "")
+ PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1)
fiAs per coding guidelines, cross-platform compatibility should be tested on Linux, macOS, and Windows.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| PPL=$(echo "$PPL_OUTPUT" | grep -oP "Final estimate: PPL = \K[0-9]+\.[0-9]+" || echo "") | |
| if [ -z "$PPL" ]; then | |
| # Try alternate format: just look for the last PPL value | |
| PPL=$(echo "$PPL_OUTPUT" | grep -oP "PPL = \K[0-9]+\.[0-9]+" | tail -1 || echo "") | |
| fi | |
| PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*Final estimate: PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1) | |
| if [ -z "$PPL" ]; then | |
| # Try alternate format: just look for the last PPL value | |
| PPL=$(echo "$PPL_OUTPUT" | sed -n 's/.*PPL = \([0-9]*\.[0-9]*\).*/\1/p' | tail -1) | |
| fi |
🤖 Prompt for AI Agents
In tests/test-q3-hifi.sh around lines 84–89, the script uses grep -oP (Perl
regex) which is not available on macOS; replace the -P usage with a
POSIX-portable extraction using grep -E (or plain grep) piped to sed or awk to
capture the numeric PPL value (e.g., match "PPL =" then extract the following
float), and keep the fallback that picks the last match if empty; ensure quoting
and exit-on-failure behavior remain unchanged so the script works on Linux and
macOS.
|
@coderabbitai do a review with my "mirror PR" preset. then answer: is there any signs this PR is AI-generated? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 8352 🏁 Script executed: Length of output: 1070 🏁 Script executed: Length of output: 248 🏁 Script executed: Length of output: 3471 🌐 Web query: 💡 Result: I couldn't find a public GitHub pull request numbered 18246 for llama.cpp nor any PR titled "Q3_HIFI". Do you want me to:
Tell me which repo or give the link/any extra detail and I'll look it up. @ngxson I've completed the review with your mirror PR preset. Here are my findings: 🚨 AI-Generated Code DetectedYes, there are clear signs this PR contains AI-generated code. I found suspicious "CRITICAL:" comments that match the pattern you asked me to watch for:
These "CRITICAL:" prefixed comments are a telltale sign of AI-generated code, especially when they state obvious implementation details. ❌ Obvious Mistakes Found1. Duplicate Function Declarations (
|
Mirror from upstream PR: ggml-org#18246
Summary by CodeRabbit
New Features
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.