Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native #151259

bonega · 2026-01-17T17:17:00Z

Summary

This PR fixes a severe performance regression in slice::is_ascii on AVX-512 CPUs when compiling with -C target-cpu=native.

On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics.

This change is intended as a temporary workaround for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted.

Problem

When is_ascii is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 kshiftrd instructions to extract mask bits one-by-one, instead of using the efficient pmovmskb
instruction. This causes a ~22x performance regression.

Because is_ascii is marked #[inline], it gets inlined and recompiled with the user's target settings, affecting anyone using -C target-cpu=native on AVX-512 CPUs.

Root cause (upstream)

The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns.

An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906

Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern.

Solution

Replace the counting loop with explicit SSE2 intrinsics (_mm_movemask_epi8) that force pmovmskb codegen regardless of CPU features.

Godbolt Links (Rust 1.92)

Pattern	Target	Link	Result
Counting loop (old)	Default SSE2	https://godbolt.org/z/sE86xz4fY	`pmovmskb`
Counting loop (old)	AVX-512 (znver4)	https://godbolt.org/z/b3jvMhGd3	31x `kshiftrd` (broken)
SSE2 intrinsics (fix)	Default SSE2	https://godbolt.org/z/hMeGfeaPv	`pmovmskb`
SSE2 intrinsics (fix)	AVX-512 (znver4)	https://godbolt.org/z/Tdvdqjohn	`vpmovmskb` (fixed)

Benchmark Results

CPU: AMD Ryzen 5 7500F (Zen 4 with AVX-512)

Default Target (SSE2) — Mixed

Size	Before	After	Change
4 B	1.8 GB/s	2.0 GB/s	+11%
8 B	3.2 GB/s	5.8 GB/s	+81%
16 B	5.3 GB/s	8.5 GB/s	+60%
32 B	17.7 GB/s	15.8 GB/s	-11%
64 B	28.6 GB/s	25.1 GB/s	-12%
256 B	51.5 GB/s	48.6 GB/s	~same
1 KB	64.9 GB/s	60.7 GB/s	~same
4 KB+	~68-70 GB/s	~68-72 GB/s	~same

Native Target (AVX-512) — Up to 24x Faster

Size	Before	After	Speedup
4 B	1.2 GB/s	2.0 GB/s	1.7x
8 B	1.6 GB/s	5.0 GB/s	3.3x
16 B	~7 GB/s	~7 GB/s	~same
32 B	2.9 GB/s	14.2 GB/s	4.9x
64 B	2.9 GB/s	23.2 GB/s	8x
256 B	2.9 GB/s	47.2 GB/s	16x
1 KB	2.8 GB/s	60.0 GB/s	21x
4 KB+	2.9 GB/s	~68-70 GB/s	23-24x

Summary

SSE2 (default): Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged
AVX-512 (native): 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s)

Note: this is the pure ascii path, but the story is similar for the others.
See linked bench project.

Test Plan

Assembly test (slice-is-ascii-avx512.rs) verifies no kshiftrd with AVX-512
Existing codegen test updated to loongarch64-only (auto-vectorization still used there)
Fuzz testing confirms old/new implementations produce identical results (~53M iterations)
Benchmarks confirm performance improvement
Tidy checks pass

Reproduction / Test Projects

Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation

bench/ - Criterion benchmarks for SSE2 vs AVX-512 comparison
fuzz/ - Compares old/new implementations with libfuzzer

Related Issues

issue opened by @folkertdev [X86] is_ascii optimizes poorly with avx-512 llvm/llvm-project#176906
Regression introduced in Optimize is_ascii for str and [u8] further #130733

When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings. The previous implementation used a counting loop that LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled, LLVM uses k-registers and extracts bits individually with ~31 `kshiftrd` instructions. This fix replaces the counting loop with explicit SSE2 intrinsics (`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64. `_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen regardless of CPU features. Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512): - Default build: ~73 GB/s → ~74 GB/s (no regression) - With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement) The loongarch64 implementation retains the original counting loop since it doesn't have this issue. Regression from: rust-lang#130733

rustbot · 2026-01-17T17:17:03Z

⚠️ #[rustc_allow_const_fn_unstable] needs careful audit to avoid accidentally exposing unstable
implementation details on stable.

cc @rust-lang/wg-const-eval

rustbot · 2026-01-17T17:17:05Z

r? @tgross35

rustbot has assigned @tgross35.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

For inputs smaller than 32 bytes, use usize-at-a-time processing instead of calling the SSE2 function. This avoids function call overhead from #[target_feature(enable = "sse2")] which prevents inlining. Also moves CHUNK_SIZE to module level so it can be shared between is_ascii and is_ascii_sse2.

bonega · 2026-01-20T08:00:03Z

Is there any guidelines about CPU detection at runtime?
If there is no drawbacks except for one memory read then I could impl a avx-512 specific path instead.
I have seen throughput at about 117GB/s

folkertdev · 2026-01-20T09:35:16Z

You cannot perform runtime feature detection in core (currently). Even if you could it really varies when that is actually worth it. Ideally you'd do it at a coarser level than a single is_ascii call.

Does a custom avx-512 path perform much better than the sse2 one compiled with target-cpu=native? LLVM should be able to handle that I think, but just might not currently.

nikic · 2026-01-20T10:22:01Z

Is there an LLVM issue for this?

folkertdev · 2026-01-20T11:57:32Z

I tried to reduce the input a bit and filed llvm/llvm-project#176906.

bonega · 2026-01-20T12:44:09Z

@folkertdev the avx512 path is 1.76x times faster for big inputs.

For the current PR my intention was just to fix the avx512 regression until LLVM fixed it.
And then revert it.
(the small input handling might still be worthwhile).

The current regression of 24x slower is critical imho.

folkertdev · 2026-01-20T13:52:58Z

the avx512 path is 1.76x times faster for big inputs.

Hmm, that's unfortunate.

the small input handling might still be worthwhile

I think it is, also because a runtime feature check is not free. So you'd have to determine some input size for which checking for avx-512 support would be worth it, and it would still slow down every non-avx-512 machine, which is still the vast majority.

The current regression of 24x slower is critical imho.

Agreed

bonega · 2026-01-20T20:36:23Z

The runtime feature check is cached, so after the first call it is just a few instructions each time.

I image that with a small input handling like present in the PR but for size 64, the runtime detection would be completely amortized.

bonega · 2026-01-21T07:51:58Z

I updated the description with links to the issue that you opened.
Also I clarified that the fix would be a temporary fix, that would be reverted when llvm fixes the codegen.

An avx512path gives 1.7x performance boost for large input, but I will save that discussion for another PR since we are missing components to even implement it.(I would like to look at it though)

tgross35 · 2026-01-22T05:54:36Z

tests/codegen-llvm/slice-is-ascii.rs

-//@ only-x86_64
-//@ compile-flags: -C opt-level=3 -C target-cpu=x86-64
+//@ only-loongarch64
+//@ compile-flags: -C opt-level=3


The two test files can be merged using revisions, which would be nice if we want the same thing for other arches:

//@ revisions: X86_64 LA64 //@ compile-flags: -C opt-level=3 // //@ [X86_64]: only-x86-64 //@ [X86_64] compile-flags: // //@ [LA64]: only-loongarch64 ...

Then use CHECK as the prefix that applies to all, and X86_64 / LA64 as the specific prefix.

tgross35 · 2026-01-22T05:57:45Z