-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native #151259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings. The previous implementation used a counting loop that LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled, LLVM uses k-registers and extracts bits individually with ~31 `kshiftrd` instructions. This fix replaces the counting loop with explicit SSE2 intrinsics (`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64. `_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen regardless of CPU features. Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512): - Default build: ~73 GB/s → ~74 GB/s (no regression) - With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement) The loongarch64 implementation retains the original counting loop since it doesn't have this issue. Regression from: rust-lang#130733
|
cc @rust-lang/wg-const-eval |
For inputs smaller than 32 bytes, use usize-at-a-time processing instead of calling the SSE2 function. This avoids function call overhead from #[target_feature(enable = "sse2")] which prevents inlining. Also moves CHUNK_SIZE to module level so it can be shared between is_ascii and is_ascii_sse2.
|
Is there any guidelines about CPU detection at runtime? |
|
You cannot perform runtime feature detection in Does a custom avx-512 path perform much better than the sse2 one compiled with |
|
Is there an LLVM issue for this? |
|
I tried to reduce the input a bit and filed llvm/llvm-project#176906. |
|
@folkertdev the avx512 path is 1.76x times faster for big inputs. For the current PR my intention was just to fix the avx512 regression until LLVM fixed it. The current regression of 24x slower is critical imho. |
Hmm, that's unfortunate.
I think it is, also because a runtime feature check is not free. So you'd have to determine some input size for which checking for avx-512 support would be worth it, and it would still slow down every non-avx-512 machine, which is still the vast majority.
Agreed |
|
The runtime feature check is cached, so after the first call it is just a few instructions each time. I image that with a small input handling like present in the PR but for size 64, the runtime detection would be completely amortized. |
|
I updated the description with links to the issue that you opened. An avx512path gives 1.7x performance boost for large input, but I will save that discussion for another PR since we are missing components to even implement it.(I would like to look at it though) |
tests/codegen-llvm/slice-is-ascii.rs
Outdated
| //@ only-x86_64 | ||
| //@ compile-flags: -C opt-level=3 -C target-cpu=x86-64 | ||
| //@ only-loongarch64 | ||
| //@ compile-flags: -C opt-level=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two test files can be merged using revisions, which would be nice if we want the same thing for other arches:
//@ revisions: X86_64 LA64
//@ compile-flags: -C opt-level=3
//
//@ [X86_64]: only-x86-64
//@ [X86_64] compile-flags:
//
//@ [LA64]: only-loongarch64
...Then use CHECK as the prefix that applies to all, and X86_64 / LA64 as the specific prefix.
library/core/src/slice/ascii.rs
Outdated
| #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] | ||
| #[target_feature(enable = "sse2")] | ||
| unsafe fn is_ascii_sse2(bytes: &[u8]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't need target_feature(enable ...) since it's already gated behind sse2, right? Which would allow the function to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe is not because of the target feature. In fact I suspect you can remove the unsafe from the function. The only pre-condition is sse2 support, and with the target_feature annotation that will get enforced anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Since you've already been looking, r? folkertdev |
library/core/src/slice/ascii.rs
Outdated
| #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] | ||
| #[target_feature(enable = "sse2")] | ||
| unsafe fn is_ascii_sse2(bytes: &[u8]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsafe is not because of the target feature. In fact I suspect you can remove the unsafe from the function. The only pre-condition is sse2 support, and with the target_feature annotation that will get enforced anyway
Combine the x86_64 and loongarch64 is_ascii tests into a single file using compiletest revisions. Both now test assembly output: - X86_64: Verifies no broken kshiftrd/kshiftrq instructions (AVX-512 fix) - LA64: Verifies vmskltz.b instruction is used (auto-vectorization)
Remove the `#[target_feature(enable = "sse2")]` attribute and make the function safe to call. The SSE2 requirement is already enforced by the `#[cfg(target_feature = "sse2")]` predicate. Individual unsafe blocks are used for intrinsic calls with appropriate SAFETY comments. Also adds FIXME reference to llvm#176906 for tracking when this workaround can be removed.
| /// | ||
| /// FIXME(llvm#176906): Remove this workaround once LLVM generates efficient code. | ||
| #[cfg(all(target_arch = "x86_64", target_feature = "sse2"))] | ||
| fn is_ascii_sse2(bytes: &[u8]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider writing the loop using as_chunks?
https://rust.godbolt.org/z/vfxK51fq8
Somehow LLVM seems to understand this less well, but it does remove a bunch of the unsafety from the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure I tested it earlier but that benchmark showed a regression.
Will check again when I get home 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with as_chunks.
I'm seeing roughly a 10% slowdown for larger inputs.
The results are consistent over several runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wild. Could you open a separate issue for that performance issue? I'd love to see what our LLVM and mir-opt people think about that. It just seems like it should be equivalent, so this might be one of those cases where LLVM overfits on what clang will give it (and we should fix that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked everything again.
I had a #[target_feature(enable = "sse2")] enabled for the benchmark which actually produces different asm code.
Your godbolt link is good and produces better performance.
I will do another PR for this.
|
@bors r+ |
…ertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a **~22x performance regression**. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) | Pattern | Target | Link | Result | |---------|--------|------|--------| | Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` | | Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) | | SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` | | SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) | ## Benchmark Results **CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed | Size | Before | After | Change | |------|--------|-------|--------| | 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** | | 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** | | 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** | | 32 B | 17.7 GB/s | 15.8 GB/s | -11% | | 64 B | 28.6 GB/s | 25.1 GB/s | -12% | | 256 B | 51.5 GB/s | 48.6 GB/s | ~same | | 1 KB | 64.9 GB/s | 60.7 GB/s | ~same | | 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same | ### Native Target (AVX-512) — Up to 24x Faster | Size | Before | After | Speedup | |------|--------|-------|---------| | 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** | | 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** | | 16 B | ~7 GB/s | ~7 GB/s | ~same | | 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** | | 64 B | 2.9 GB/s | 23.2 GB/s | **8x** | | 256 B | 2.9 GB/s | 47.2 GB/s | **16x** | | 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** | | 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** | ### Summary - **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in rust-lang#130733
…uwer Rollup of 7 pull requests Successful merges: - #149848 (Use allocator_shim_contents in allocator_shim_symbols) - #150556 (Add Tier 3 Thumb-mode targets for Armv7-A, Armv7-R and Armv8-R) - #151259 (Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native) - #151482 (Add "Skip to main content" link for keyboard navigation in rustdoc) - #151505 (Various refactors to the proc_macro bridge) - #151517 (Enable reproducible binary builds with debuginfo on Linux) - #151540 (Tweak bounds check in `DepNodeColorMap.get`) r? @ghost
…ertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a **~22x performance regression**. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) | Pattern | Target | Link | Result | |---------|--------|------|--------| | Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` | | Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) | | SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` | | SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) | ## Benchmark Results **CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed | Size | Before | After | Change | |------|--------|-------|--------| | 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** | | 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** | | 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** | | 32 B | 17.7 GB/s | 15.8 GB/s | -11% | | 64 B | 28.6 GB/s | 25.1 GB/s | -12% | | 256 B | 51.5 GB/s | 48.6 GB/s | ~same | | 1 KB | 64.9 GB/s | 60.7 GB/s | ~same | | 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same | ### Native Target (AVX-512) — Up to 24x Faster | Size | Before | After | Speedup | |------|--------|-------|---------| | 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** | | 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** | | 16 B | ~7 GB/s | ~7 GB/s | ~same | | 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** | | 64 B | 2.9 GB/s | 23.2 GB/s | **8x** | | 256 B | 2.9 GB/s | 47.2 GB/s | **16x** | | 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** | | 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** | ### Summary - **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in rust-lang#130733
Rollup of 6 pull requests Successful merges: - #149848 (Use allocator_shim_contents in allocator_shim_symbols) - #150556 (Add Tier 3 Thumb-mode targets for Armv7-A, Armv7-R and Armv8-R) - #151259 (Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native) - #151482 (Add "Skip to main content" link for keyboard navigation in rustdoc) - #151505 (Various refactors to the proc_macro bridge) - #151517 (Enable reproducible binary builds with debuginfo on Linux) r? @ghost
…ertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a **~22x performance regression**. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) | Pattern | Target | Link | Result | |---------|--------|------|--------| | Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` | | Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) | | SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` | | SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) | ## Benchmark Results **CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed | Size | Before | After | Change | |------|--------|-------|--------| | 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** | | 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** | | 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** | | 32 B | 17.7 GB/s | 15.8 GB/s | -11% | | 64 B | 28.6 GB/s | 25.1 GB/s | -12% | | 256 B | 51.5 GB/s | 48.6 GB/s | ~same | | 1 KB | 64.9 GB/s | 60.7 GB/s | ~same | | 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same | ### Native Target (AVX-512) — Up to 24x Faster | Size | Before | After | Speedup | |------|--------|-------|---------| | 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** | | 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** | | 16 B | ~7 GB/s | ~7 GB/s | ~same | | 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** | | 64 B | 2.9 GB/s | 23.2 GB/s | **8x** | | 256 B | 2.9 GB/s | 47.2 GB/s | **16x** | | 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** | | 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** | ### Summary - **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in rust-lang#130733
…uwer Rollup of 8 pull requests Successful merges: - #150556 (Add Tier 3 Thumb-mode targets for Armv7-A, Armv7-R and Armv8-R) - #151259 (Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native) - #151500 (hexagon: Add HVX target features) - #151517 (Enable reproducible binary builds with debuginfo on Linux) - #151482 (Add "Skip to main content" link for keyboard navigation in rustdoc) - #151489 (constify boolean methods) - #151551 (Don't use default build-script fingerprinting in `test`) - #151555 (Fix compilation of std/src/sys/pal/uefi/tests.rs) r? @ghost
Rollup merge of #151259 - bonega:fix-is-ascii-avx512, r=folkertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a **~22x performance regression**. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) | Pattern | Target | Link | Result | |---------|--------|------|--------| | Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` | | Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) | | SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` | | SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) | ## Benchmark Results **CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed | Size | Before | After | Change | |------|--------|-------|--------| | 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** | | 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** | | 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** | | 32 B | 17.7 GB/s | 15.8 GB/s | -11% | | 64 B | 28.6 GB/s | 25.1 GB/s | -12% | | 256 B | 51.5 GB/s | 48.6 GB/s | ~same | | 1 KB | 64.9 GB/s | 60.7 GB/s | ~same | | 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same | ### Native Target (AVX-512) — Up to 24x Faster | Size | Before | After | Speedup | |------|--------|-------|---------| | 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** | | 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** | | 16 B | ~7 GB/s | ~7 GB/s | ~same | | 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** | | 64 B | 2.9 GB/s | 23.2 GB/s | **8x** | | 256 B | 2.9 GB/s | 47.2 GB/s | **16x** | | 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** | | 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** | ### Summary - **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in #130733
Use explicit SSE2 intrinsics with `_mm_movemask_epi8` instead of relying on LLVM auto-vectorization. This generates efficient `pmovmskb` code on all x86_64 targets and avoids LLVM's broken AVX-512 auto-vectorization which produces ~31 `kshiftrd` instructions. The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster Adds assembly test to verify: - `kshiftrd`/`kshiftrq` are NOT generated - `pmovmskb`/`vpor` ARE generated Improves on rust-lang#151259. See: llvm/llvm-project#176906
Use explicit SSE2 intrinsics to avoid LLVM's broken AVX-512 auto-vectorization which generates ~31 kshiftrd instructions. Performance (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster Improves on rust-lang#151259.
|
Thanks a lot for the help ❤️ |
…erformance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> Adds assembly test to verify: - `kshiftrd`/`kshiftrq` are NOT generated - `pmovmskb`/`vpor` ARE generated ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
…erformance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
…erformance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
Rollup merge of #151611 - bonega:improve-is-slice-is-ascii-performance, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in #151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before #151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
…e, r=folkertdev Improve is_ascii performance on x86_64 with explicit SSE2 intrinsics # Summary Improves `slice::is_ascii` performance for SSE2 target roughly 1.5-2x on larger inputs. AVX-512 keeps similiar performance characteristics. This is building on the work already merged in rust-lang/rust#151259. In particular this PR improves the default SSE2 performance, I don't consider this a temporary fix anymore. Thanks to @folkertdev for pointing me to consider `as_chunk` again. # The implementation: - Uses 64-byte chunks with 4x 16-byte SSE2 loads OR'd together - Extracts the MSB mask with a single `pmovmskb` instruction - Falls back to usize-at-a-time SWAR for inputs < 64 bytes # Performance impact (vs before rust-lang/rust#151259): - AVX-512: 34-48x faster - SSE2: 1.5-2x faster <details> <summary>Benchmark Results (click to expand)</summary> Benchmarked on AMD Ryzen 9 9950X (AVX-512 capable). Values show relative performance (1.00 = fastest). Tops out at 139GB/s for large inputs. ### early_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | 1.01 | **1.00** | 13.45 | 1.13 | | 1024 | 1.01 | **1.00** | 13.53 | 1.14 | | 65536 | 1.01 | **1.00** | 13.99 | 1.12 | | 1048576 | 1.02 | **1.00** | 13.29 | 1.12 | ### late_non_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 64 | **1.00** | 1.01 | 13.37 | 1.13 | | 1024 | 1.10 | **1.00** | 42.42 | 1.95 | | 65536 | **1.00** | 1.06 | 42.22 | 1.73 | | 1048576 | **1.00** | 1.03 | 34.73 | 1.46 | ### pure_ascii | Input Size | new_avx512 | new_sse2 | old_avx512 | old_sse2 | |------------|------------|----------|------------|----------| | 4 | 1.03 | **1.00** | 1.75 | 1.32 | | 8 | **1.00** | 1.14 | 3.89 | 2.06 | | 16 | **1.00** | 1.04 | 1.13 | 1.62 | | 32 | 1.07 | 1.19 | 5.11 | **1.00** | | 64 | **1.00** | 1.13 | 13.32 | 1.57 | | 128 | **1.00** | 1.01 | 19.97 | 1.55 | | 256 | **1.00** | 1.02 | 27.77 | 1.61 | | 1024 | **1.00** | 1.02 | 41.34 | 1.84 | | 4096 | 1.02 | **1.00** | 45.61 | 1.98 | | 16384 | 1.01 | **1.00** | 48.67 | 2.04 | | 65536 | **1.00** | 1.03 | 43.86 | 1.77 | | 262144 | **1.00** | 1.06 | 41.44 | 1.79 | | 1048576 | 1.02 | **1.00** | 35.36 | 1.44 | </details> ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer Relates to: llvm/llvm-project#176906
Summary
This PR fixes a severe performance regression in
slice::is_asciion AVX-512 CPUs when compiling with-C target-cpu=native.On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics.
This change is intended as a temporary workaround for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted.
Problem
When
is_asciiis compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31kshiftrdinstructions to extract mask bits one-by-one, instead of using the efficientpmovmskbinstruction. This causes a ~22x performance regression.
Because
is_asciiis marked#[inline], it gets inlined and recompiled with the user's target settings, affecting anyone using-C target-cpu=nativeon AVX-512 CPUs.Root cause (upstream)
The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns.
An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906
Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern.
Solution
Replace the counting loop with explicit SSE2 intrinsics (
_mm_movemask_epi8) that forcepmovmskbcodegen regardless of CPU features.Godbolt Links (Rust 1.92)
pmovmskbkshiftrd(broken)pmovmskbvpmovmskb(fixed)Benchmark Results
CPU: AMD Ryzen 5 7500F (Zen 4 with AVX-512)
Default Target (SSE2) — Mixed
Native Target (AVX-512) — Up to 24x Faster
Summary
Note: this is the pure ascii path, but the story is similar for the others.
See linked bench project.
Test Plan
slice-is-ascii-avx512.rs) verifies nokshiftrdwith AVX-512loongarch64-only (auto-vectorization still used there)Reproduction / Test Projects
Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation
bench/- Criterion benchmarks for SSE2 vs AVX-512 comparisonfuzz/- Compares old/new implementations with libfuzzerRelated Issues
is_asciioptimizes poorly with avx-512 llvm/llvm-project#176906is_asciiforstrand[u8]further #130733