Releases: huggingface/kernel-builder
v0.8.0
New features
Support Metal 4 on macOS
kernel-builder builds Metal kernels using Metal 4 support since this release. The minimum required SDK and macOS versions are 26. For more information on how to set up a development environment, see our Metal docs.
Experimental support for Python dependencies
This version adds support for kernel Python dependencies. So far, we mostly considered kernels to be either pure PyTorch + Triton or compiled CUDA/ROCm/XPU with a small Torch wrapper. This assumption made kernels easy to deploy everywhere, since they do not have external dependencies. However, DSLs for writing kernels, such as the CUTLASS DSL, are becoming increasingly popular.
To accommodate such DSLs without bringing back the issues that dependencies have, we allow a small, curated set of dependencies. Currently the only allowed dependencies are einops and nvidia-cutlass-dsl. Dependencies can be added using the new python-depends option of the general section in build.toml:
[general]
name = "my-kernel"
# ...
python-depends = ["nvidia-cutlass-dsl"]
The dependencies are also validated by kernels when a kernel that uses dependencies is downloaded.
build-and-upload
A new build-and-upload command is added that builds and uploads a kernel in one go. If the kernel is not in kernels-community, you can specify the upload location in general.hub:
[general.hub]
repo-id = "my-org/my-kernel"
Flattened build directories
Thus far, kernels were stored in build/<variant>/<module_name>. This version of kernel-builder changes this to build/<variant>. This solves the issue where are kernel cannot be loaded when module_name does not match the repository name (e.g. after a rename). For the next few releases, kernel-builder will put a compatibility module at build/<variant>/<module_name> to make sure that a kernel can be loaded with an older version of kernels.
What's Changed
- misc(builder): enable detection of ARM64 arch on Windows and turn on correct VS / CMake environments by @mfuntowicz in #272
- fix(windows): always define _WIN32 preprocessor macro to prevent PyTorch compiling unsupported code by @mfuntowicz in #275
- bug(windows): fix invalid generated build name by @mfuntowicz in #274
- feat(windows): allow detecting Python executable by @mfuntowicz in #276
- Missing Windows knobs to make it compatible with kernels by @mfuntowicz in #277
- do not use ONEDNN_XPU_INCLUDE_DIR since it's only needed for torch2_7. by @sywangyi in #273
- Add
build-and-uploadcommand by @danieldk in #278 - Remove examples/activation by @danieldk in #261
- fix(build2cmake): ignore untracked files when looking for modified files to suffix with
_dirtyby @mfuntowicz in #280 - feat(windows): do not include cxx11 ABI flag when generating names by @mfuntowicz in #281
- Remove duplicate build variant name code by @danieldk in #285
- Add support for building CPU-only kernels by @danieldk in #284
- Remove Python bytecode after checks by @danieldk in #286
- Use correct Python interpreter for metallib_to_header by @danieldk in #288
- Fix metal kernels support by @MekkCyber in #287
- Switch to binary Torch wheels by @danieldk in #289
- Update to macOS SDK 26 and Metal 4 by @danieldk in #290
- Add doc on the required environment for Metal by @danieldk in #292
- Also remove bytecode from universal builds by @danieldk in #294
- Include CPU kernels in CI builds by @danieldk in #296
- Flatten build variants to
build/<variant>by @danieldk in #293 - Allow dashes in kernel names by @danieldk in #297
- extensionName -> moduleName by @danieldk in #298
- feat: support metal cpp by @drbh in #295
- Add support for (limited) Python dependencies: nvidia-cutlass-dsl and einops by @danieldk in #302
- Copy over Torch from hf-nix and fix the AArch64 build by @danieldk in #304
- Remove dependency on hf-nix by @danieldk in #305
- fix(windows): force USE_CUDA/USE_ROCM definitions to ensure PyTorch guards are not bypassed by @mfuntowicz in #303
- Fix typos by @omahs in #309
- Extend cutlass to bmg by @sywangyi in #307
- Update tracing-subscriber to solve dependabot issue by @danieldk in #310
- gen-flake-outputs: add
backendBundleoutput by @danieldk in #312 - Add a Discord link by @danieldk in #313
- Set version to 0.8.0-dev0 by @danieldk in #315
New Contributors
Full Changelog: v0.7.0...v0.8.0
v0.7.0
New features
PyTorch 2.9.0 support
kernel-builder now builds kernels for PyTorch 2.8.0 and 2.9.0 by default. Support for PyTorch 2.7.0 was removed, conforming to our policy to support the latest two releases.
Windows builder
This release contains experimental support for building Windows kernels. Since Nix is not supported on Windows, the separate PowerShell script scripts/windows/builder.ps1 is provided to build kernels on Windows.
Binary Torch wheels
kernel-builder now supports building against binary Torch wheels. This speeds up roll-out of support for new Torch versions or vendor-specific Torch builds. Build variants that do not have sourceBuild = true set will use a Torch binary wheel. We will soon switch over to using binary wheels as a default.
What's Changed
- Add
cutlass_4_0as a dependency by @danieldk in #229 - Add onednn support for XPU by @sywangyi in #227
- README: remove Torch 2.7 note by @danieldk in #231
- Append sycl flags for cutlass in 2025.2.x by @sywangyi in #233
- ROCm: add rocwmma-devel by @danieldk in #236
- misc(cmake): rely on the more robust Python3 module to handle Python detection by @mfuntowicz in #238
- Add non-bundle
torch28-cxx11-rocm64-x86_64-linuxvariant by @danieldk in #240 - Add Torch 2.9 build variants by @danieldk in #242
- fix: update docker files and readme command by @drbh in #239
- ROCm: apply specified archs to sources by @danieldk in #237
- Add target tiers by @drbh in #241
- Update to Torch 2.9.0-rc6 by @danieldk in #245
- Make separate
forCacheandforCacheNonBundlepackages by @danieldk in #244 - Update hf-nix input by @danieldk in #248
- Select default shells from available variants by @danieldk in #247
- Add
kernelspackage/command to generated flake output by @danieldk in #251 - Remove
kernel-compliance-checkby @danieldk in #253 - Add
cioutput for generated flakes by @danieldk in #254 - do not use ENV(DPCPP_VERSION) which is not friendly to local build(ne… by @sywangyi in #255
- Restructure extension to cache extra ROCm and XPU dependencies by @danieldk in #257
- Bump hf-nix for kernels 0.10.3 by @danieldk in #259
- Fix Nix invocations in the README and Dockerfiles by @danieldk in #260
- Use binary wheels for Torch by @danieldk in #252
- Small comment/doc fixes by @danieldk in #263
- Introducing a way to builder kernels for Windows platforms by @mfuntowicz in #250
- Update hf-nix for kernels 0.10.4 by @danieldk in #265
- Check that
cuda-flags/sycl-flagsare properly passed by @danieldk in #264 - feat: Add support for ROCm/HIP flags by @shadeMe in #262
- CI: sandboxed builds and enable Cachix pushes by @danieldk in #266
- Hotfix CUDA devshell by switching back to source build by @danieldk in #268
- Hotfix ROCm and XPU devshells by @danieldk in #269
- Update to final Torch 2.9.0 by @danieldk in #270
- Set version to 0.7.0-dev0 by @danieldk in #271
New Contributors
- @mfuntowicz made their first contribution in #238
- @shadeMe made their first contribution in #262
Full Changelog: v0.6.2...v0.7.0
v0.6.2
New Features
Intel XPU support
This release of kernel-builder adds XPU support. Many thanks to @sywangyi for implementing this! You can use the xpu backend type in build.toml for XPU kernels. For example:
[kernel.activation_xpu]
backend = "xpu"
depends = ["torch"]
src = ["relu_xpu/relu.cpp"]The ReLU example kernel shows how you can make a kernel that support CUDA, ROCm and XPU backends.
kernel-abi-check Python binding
kernel-abi-check now also has a Python binding. This will be used by the upcoming kernels check subcommand.
API changes
Prior to this version, a kernel would have to provide the Git revision to genFlakeOutputs in its flake.nix. For example:
kernel-builder.lib.genFlakeOutputs {
path = ./.;
rev = self.shortRev or self.dirtyShortRev or self.lastModifiedDate;
};Starting with version 0.6.2, kernel-builder determines the revision. Instead, a kernel has to pass through the flake itself (self):
kernel-builder.lib.genFlakeOutputs {
inherit self;
path = ./.;
};The old invocation of genFlakeOutputs still works with a warning, but will be deprecated in the future.
What's Changed
- Add XPU support by @danieldk in #210
- Add
xputo the docs by @danieldk in #211 - Cache build2cmake and kernel-abi-check by @danieldk in #213
- Improve cutlass-sycl support by @danieldk in #214
- Fix a regression in test shells by @danieldk in #217
- build-and-copy: correctly get variant by @danieldk in #218
- build-and-copy: copy build from the bundle output, not
result/by @danieldk in #219 - Add a license by @danieldk in #220
- add dnnl to the link library, some kernels need onednn by @sywangyi in #224
- Add initial kernel building security guidelines by @danieldk in #216
- Let kernel-builder determine the kernel revision by @danieldk in #221
- hotfix: add
onednn-xputo the build inputs by @danieldk in #226 - Add a Python binding for kernel-abi-check by @danieldk in #225
- Add kernel-abi-check-python release workflow by @danieldk in #228
Full Changelog: v0.6.1...v0.6.2
v0.6.1
New Features
build-and-copy command
Before this release one had to build a kernel with nix build first and then copy the build variants from result to build. This can now be done in a single step with build-and-copy:
$ nix run .#build-and-copy -LAutomatic virtual environment for nix develop
Running nix develop in a kernel will now automatically create a virtual environment in .venv (if it does not exist) and activate it.
Docs
examples/relu-backprop-compile provides an example on how to make a kernel with backprop and torch.compile support.
What's Changed
- Disable cachix pushes, sandboxing is not enabled by @danieldk in #198
- [XPU]Add support for cutlass-sycl by @danieldk in #200
- Fix handling of the 9.0a and 12.0a capabilities by @danieldk in #202
- Update hf-nix and remove
sanitiseHeaderPathsHookworkaround by @danieldk in #201 - kernel devshell: automatically create venv by @danieldk in #205
- Move kernel flake outputs generation to a separate file by @danieldk in #207
- Add a full ReLU example with backprop and
torch.compilesupport by @danieldk in #206 - Add build-and-copy package by @danieldk in #208
Full Changelog: v0.6.0...v0.6.1
v0.6.0
New features
PyTorch 2.8 support
kernel-builder now supports PyTorch 2.8 in the following (upstream) build configurations:
- CUDA 12.6, 12.8, and 12.9 on aarch64-linux and x86_64-linux.
- ROCm 6.3 and 6.4 on x86_64-linux.
- Metal on aarch64-darwin (macOS).
Following the kernel-builder support policy, support for Torch 2.6 is removed.
Additional compliance testing
Besides the ABI checks (manylinux and abi3 compliance), kernel-builder now also checks if the kernel can be loaded by the kernels package. This ensures, among other things, that imports are relative. This check can be an issue with some Triton kernels that use the autotune decorator, since the build sandbox does not have access to GPUs. In this case the check can be disabled by passing doGetKernelCheck = false
Support for generating PTX
When defining CUDA capabilities, it is now possible to add the +PTX suffix to generate PTX code. For example:
cuda-capabilities = [ "7.0", "8.0+PTX"]When no CUDA capabilities are specified for a kernel, PTX is generated for capability 9.0 (and 12.0 on CUDA >= 12.8).
What's Changed
- feat: include cachix instructions in readme by @drbh in #178
- Add check that imports the kernel with
kernelsby @danieldk in #179 - Make get-kernel-check work on macOS by @danieldk in #180
- Embed compiled metal kernels into binary by @EricLBuehler in #181
- Make
hipify_sources_targetwork with multiple include dirs by @danieldk in #183 - Add
doGetKernelCheckoption togenFlakeOutputsby @danieldk in #188 - Set
HOMEin get-kernel-check-hook by @danieldk in #189 - CI: try to build a macOS kernel by @danieldk in #184
- kernel-abi-check: improve description by @danieldk in #190
- add xpu build support by @sywangyi in #185
- feat: Docker build with buildx for cross compilation by @drbh in #191
- Pass default versions set
torchVersionsby @danieldk in #193 - Support
+PTXin cuda capabilities by @danieldk in #196 - Add Torch 2.8, remove Torch 2.6 by @danieldk in #182
New Contributors
Full Changelog: v0.5.2...v0.6.0
v0.5.2
This release contains changes for handling more complex kernels:
- Support minimum CUDA versions for 'subkernels' (
kernel.<name>) for when a kernel has specializations for e.g. Blackwell. - Support passing custom flags to the C++ compiler.
- Add support for building kernels for a subset of CUDA versions. Use leads to non-compliant kernels, so should only be used as a last resort.
What's Changed
- Add
cuda-maxveroption to thegeneralsection by @danieldk in #170 - build2cmake:
cxx-flagsoption for C++ compile flags for kernels by @danieldk in #171 - hotfix:
cuda-maxverby @danieldk in #172 - hotfix: cuda-maxver nit in Nix by @danieldk in #173
- Add support for building for a custom set of Torch versions by @danieldk in #174
- Add
cuda-minveroption for CUDA kernels by @danieldk in #176 - Set build2cmake and kernel-abi-check to 0.5.2 for release prep by @danieldk in #177
Full Changelog: v0.5.1...v0.5.2
v0.5.1
This release contains various bugfixes.
What's Changed
- Add
cuda-minveroption to thegeneralsection by @danieldk in #165 - Darwin: rewrite Nix store paths by @danieldk in #167
- build2cmake: remove metallib install by @EricLBuehler in #168
- Dockerfile improve local path by @drbh in #166
- Set build2cmake and kernel-abi-check to 0.5.1 for release prep by @danieldk in #169
Full Changelog: v0.5.0...v0.5.1
v0.5.0
This release adds support for building Metal kernels for Apple Silicon Macs. To accommodate non-CUDA/ROCm kernels, the build.toml format has been updated. You can update an existing build.toml using build2cmake:
$ build2cmake update-build /path/to/build.tomlYou can also directly run this command with Nix:
$ nix run github:huggingface/kernel-builder/v0.5.0#update-build /path/to/build.tomlWhat's Changed
- Update the build.toml format in preparation for Metal by @danieldk in #144
- Provide better errors when deserializing
build.tomlby @danieldk in #145 - feat: built root and user docker image variants by @drbh in #139
- Add basic support for building Metal 🤘 kernels by @danieldk in #146
- Add support for building macOS Metal kernels by @danieldk in #147
- fix: adjust the update build command in the container by @drbh in #149
- Enable Metal as part of bundle builds by @danieldk in #151
- Propagate ABI check errors and fix on macOS by @danieldk in #154
- feat: allow precompilation for metal kernels by @EricLBuehler in #152
- build2cmake: add clean subcommand by @EricLBuehler in #156
- kernel-abi-check: check macOS minimum version by @danieldk in #157
- Add cutlass 3.9 as a dependency by @danieldk in #159
- build2cmake: cuda_flags option for compile flags for CUDA kernels by @danieldk in #160
- Update build.toml docs by @danieldk in #161
- Accept
checkInputsandnativeCheckInputsingenFlakeOutputsby @danieldk in #155 - Small Nix documentation improvements by @danieldk in #162
- Hotfix: append CUDA flags by @danieldk in #163
- Set build2cmake and kernel-abi-check to 0.5.0 for release prep by @danieldk in #164
New Contributors
- @EricLBuehler made their first contribution in #152
Full Changelog: v0.4.0...v0.5.0
v0.4.0
What's Changed
- Add a CUDA 12.9 build variant for Torch 2.7 by @danieldk in #136
- feat: update docker for remote build and push by @drbh in #115
- build2cmake: attempt to get shorthash-based ops id using git by @danieldk in #137
- Standardizing torch_binding.cpp and torch_binding.h in the doc by @MekkCyber in #138
- Bump nixpkgs to version with cuDNN sbsa by @danieldk in #140
- Switch to the to-be hf-nix repo by @danieldk in #141
Full Changelog: v0.3.0...v0.4.0