-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
Rollup of 8 pull requests #151575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollup of 8 pull requests #151575
Conversation
When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings. The previous implementation used a counting loop that LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled, LLVM uses k-registers and extracts bits individually with ~31 `kshiftrd` instructions. This fix replaces the counting loop with explicit SSE2 intrinsics (`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64. `_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen regardless of CPU features. Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512): - Default build: ~73 GB/s → ~74 GB/s (no regression) - With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement) The loongarch64 implementation retains the original counting loop since it doesn't have this issue. Regression from: rust-lang#130733
For inputs smaller than 32 bytes, use usize-at-a-time processing instead of calling the SSE2 function. This avoids function call overhead from #[target_feature(enable = "sse2")] which prevents inlining. Also moves CHUNK_SIZE to module level so it can be shared between is_ascii and is_ascii_sse2.
This will be used in order to emit HVX intrinsics
Combine the x86_64 and loongarch64 is_ascii tests into a single file using compiletest revisions. Both now test assembly output: - X86_64: Verifies no broken kshiftrd/kshiftrq instructions (AVX-512 fix) - LA64: Verifies vmskltz.b instruction is used (auto-vectorization)
Remove the `#[target_feature(enable = "sse2")]` attribute and make the function safe to call. The SSE2 requirement is already enforced by the `#[cfg(target_feature = "sse2")]` predicate. Individual unsafe blocks are used for intrinsic calls with appropriate SAFETY comments. Also adds FIXME reference to llvm#176906 for tracking when this workaround can be removed.
Removed comment about reproducibility failures with crate type `bin` and `-Cdebuginfo=2` on non windows machines issue rust-lang#89911
Implements WCAG 2.4.1 (Level A) - Bypass Blocks accessibility feature. Changes: - Add skip-main-content link in page.html with tabindex=-1 on main-content - Add CSS styling per reviewer feedback (outline border, themed colors) - Add GOML test for skip navigation functionality Fixes rust-lang#151420
This changes the `test` build script so that it does not use the default fingerprinting mechanism in cargo which causes a full scan of the package every time it runs. This build script does not depend on any of the files in the package. This is the recommended approach for writing build scripts.
Dropped the `align` test since the `POOL_ALIGNMENT` and `align_size` items it uses do not exist. The other changes are straightforward fixes for places where the test code drifted from the current API, since the tests are not yet built in CI for the UEFI target.
…umbv8r, r=petrochenkov Add Tier 3 Thumb-mode targets for Armv7-A, Armv7-R and Armv8-R We currently have targets for bare-metal Armv7-R, Armv7-A and Armv8-R, but only in Arm mode. This PR adds five new targets enabling bare-metal support on these architectures in Thumb mode. This has been tested using https://github.com/rust-embedded/aarch32/compare/main...thejpster:aarch32:support-thumb-mode-v7-v8?expand=1 and they all seem to work as expected. However, I wasn't sure what to do with the maintainer lists as these are five new targets, but they share the docs page with the existing Arm versions. I can ask the Embedded Devices WG Arm Team about taking on these ones too, but whether Arm themselves want to take them on I guess is a bigger question.
…ertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a **~22x performance regression**. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) | Pattern | Target | Link | Result | |---------|--------|------|--------| | Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` | | Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) | | SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` | | SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) | ## Benchmark Results **CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed | Size | Before | After | Change | |------|--------|-------|--------| | 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** | | 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** | | 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** | | 32 B | 17.7 GB/s | 15.8 GB/s | -11% | | 64 B | 28.6 GB/s | 25.1 GB/s | -12% | | 256 B | 51.5 GB/s | 48.6 GB/s | ~same | | 1 KB | 64.9 GB/s | 60.7 GB/s | ~same | | 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same | ### Native Target (AVX-512) — Up to 24x Faster | Size | Before | After | Speedup | |------|--------|-------|---------| | 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** | | 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** | | 16 B | ~7 GB/s | ~7 GB/s | ~same | | 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** | | 64 B | 2.9 GB/s | 23.2 GB/s | **8x** | | 256 B | 2.9 GB/s | 47.2 GB/s | **16x** | | 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** | | 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** | ### Summary - **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in rust-lang#130733
…r=folkertdev hexagon: Add HVX target features This will be used in order to emit HVX intrinsics
…sts-linux, r=Kobzol Enable reproducible binary builds with debuginfo on Linux Fixes rust-lang#89911 This PR enables `-Cdebuginfo=2` for binary crate types in the `reproducible-build` run-make test on Linux platforms. - Removed the `!matches!(crate_type, CrateType::Bin)` check in `diff_dir_test()` - SHA256 hashes match: `932be0d950f4ffae62451f7b4c8391eb458a68583feb11193dd501551b6201d4` This scenario was previously disabled due to rust-lang#89911. I have verified locally on Linux (WSL) with LLVM 21 that the regression reported in that issue appears to be resolved, and the tests now pass with debug info enabled.
…r=GuillaumeGomez Add "Skip to main content" link for keyboard navigation in rustdoc ## Summary This PR adds a "Skip to main content" link for keyboard navigation in rustdoc, improving accessibility by allowing users to bypass the sidebar and navigate directly to the main content area. ## Changes - **`src/librustdoc/html/templates/page.html`**: Added a skip link (`<a class="skip-main-content">`) immediately after the `<body>` tag that links to `#main-content` - **`src/librustdoc/html/static/css/rustdoc.css`**: Added CSS styles for the skip link: - Visually hidden by default (`position: absolute; top: -100%`) - Becomes visible when focused via Tab key (`top: 0` on `:focus`) - Styled consistently with rustdoc theme using existing CSS variables - **`tests/rustdoc-gui/skip-navigation.goml`**: Added GUI test to verify the skip link functionality ## WCAG Compliance This addresses **WCAG Success Criterion 2.4.1 (Level A)** - Bypass Blocks: > A mechanism is available to bypass blocks of content that are repeated on multiple web pages. ## Demo When pressing Tab on a rustdoc page, the first focusable element is now the "Skip to main content" link, allowing keyboard users to jump directly to the main content without tabbing through the entire sidebar. ## Future Improvements Based on the discussion in rust-lang#151420, additional skip links could be added between the page summary and module contents sections. This PR provides the foundation, and we can iterate on adding more skip links based on feedback. Fixes rust-lang#151420 r? @JayanAXHF
…s-under-feature-gate-const-bool, r=jhpratt
constify boolean methods
```rs
// core::bool
impl bool {
pub const fn then_some<T: [const] Destruct>(self, t: T) -> Option<T>;
pub const fn then<T, F: [const] FnOnce() -> T + [const] Destruct>(self, f: F) -> Option<T>;
pub const fn ok_or<E: [const] Destruct>(self, err: E) -> Result<(), E>;
pub const fn ok_or_else<E, F: [const] FnOnce() -> E + [const] Destruct>;
}
```
will make tracking issue if pr liked
Don't use default build-script fingerprinting in `test` This changes the `test` build script so that it does not use the default fingerprinting mechanism in cargo which causes a full scan of the package every time it runs. This build script does not depend on any of the files in the package. This is the recommended approach for writing build scripts.
…i-test, r=Ayush1325,tgross35 Fix compilation of std/src/sys/pal/uefi/tests.rs Dropped the `align` test since the `POOL_ALIGNMENT` and `align_size` items it uses do not exist. The other changes are straightforward fixes for places where the test code drifted from the current API, since the tests are not yet built in CI for the UEFI target. CC @Ayush1325
|
@bors r+ rollup=never p=5 |
This comment has been minimized.
This comment has been minimized.
|
📌 Perf builds for each rolled up PR:
previous master: 87b2721871 In the case of a perf regression, run the following command for each PR you suspect might be the cause: |
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 87b2721 (parent) -> a18e6d9 (this PR) Test differencesShow 79 test diffsStage 0
Stage 1
Stage 2
Additionally, 50 doctest diffs were found. These are ignored, as they are noisy. Job group index
Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard a18e6d9d1473d9b25581dd04bef6c7577999631c --output-dir test-dashboardAnd then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
|
Finished benchmarking commit (a18e6d9): comparison URL. Overall result: ❌✅ regressions and improvements - no action needed@rustbot label: -perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 3.1%, secondary -0.8%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -1.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 471.473s -> 469.207s (-0.48%) |
Successful merges:
test#151551 (Don't use default build-script fingerprinting intest)r? @ghost
Create a similar rollup