Skip to content

Conversation

@ipanfilo
Copy link
Collaborator

@ipanfilo ipanfilo commented Jan 13, 2026

Description

IFU TE 2.8.0.dev0 commit 7f77127 from 2025-09-18

Fixes #14813

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

KshitijLakhani and others added 30 commits July 21, 2025 01:04
Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
* Remove GH pinned deps

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Pin onnxscript

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Reset FP8 weight workspace if usages are invalid

Signed-off-by: Tim Moon <tmoon@nvidia.com>
…end` (#1965)

Update utils.py

Fix the condition error of the FP8 attention in `get_attention_backend`

Signed-off-by: yuzhongw-nvidia <yuzhongw@nvidia.com>
Co-authored-by: Xiaowei Ren <103958965+xrennvidia@users.noreply.github.com>
* exclude 9.10.0/.1 for certain configs

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix kv_channels

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add get_backend to tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add init files

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix numerics and cuda graph tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix jax tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor changes after renaming

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix import structure and rename get_attention_backends

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix docs and benchmarks

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix get backend calls

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix get backend calls"

This reverts commit 653cbb51c697bc2f975416bb3aac1d85f76c36dc.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "fix docs and benchmarks"

This reverts commit 98cd52e04ff7c53e26b412195f5744e39f7ed0e9.
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix docs, benchmarks and pre-commit ci

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix dpa/mha flash attn selection

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix rng states

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix backend selection on Ampere

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix issues from last merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/utils.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove initialization of rng_states to None

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* redefine ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix ModelConfig

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix seed for CP tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Update tests/pytorch/test_sanity.py

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move fixture from utils to individual tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
…ug quantizer (#1963)

* Debug linear layer when saving original input and using debug quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Workaround bugs with quantizing with only column-wise usage

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused imports

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid unnecessary row-wise data

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Workaround bugs with quantizing with only column-wise usage

FP8 does not support transpose-only cast.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fixed conflicts

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor code refactoring to avoid unnecessary checks

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typo

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed dBias accumulation error due to initialization. Minor code refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Test case to reproduce the init error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed rowwise dbias error

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Changed ptx API

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added a struct for two packed FP8 values

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Rolled back to scalar code for columnwise scaling due to its better performance

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor corrections

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rebased on main

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes per code review

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed constexpr in C++ test suite to build faster

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Computed activations are now numerically truncated to InputType before scaling. Improved test suite.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Minor refactoring

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Modified mismatches checks of MXFP8 to address FP8 numerics

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Implemented Jeremy's fixes to JAX test suite with an intermediate downcast

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reduced the dims of the test tensors to improve CI runtime

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* Fixed memory alignment issue. Compute dbias without downcast.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed misaligned memory issue also in gated kernels. Reduced size of MXFP8 gated tests

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Refactor _OperationFuserAutogradFunction.forward to use less parameters

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit f8f59b1bb184e89468058521df4cfff029ad909c)

* Rename `BackwardBiasActivation` to `BackwardActivationBias`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 397c58fc296f801fe4ad600aadc2daff3b78be45)

* Use forward operation order in backward fused operations

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 2d37a9385069b066e6cdeff3eb9173c2079cb791)

* Rename `prev_op_grad_input_quantizer` to `prev_op_grad_output_quantizer`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit d7ab5dfb23e216866f7f4fc4d7a99f625d329f1e)

* Make OperationFuser persistent

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 77984d9715d31e87519dc6ea1e02c483a81355a7)

* Distribute extra inputs to and collect extra outputs from multiple module groups in Sequential

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 0716aaad542e59f2c1ac4620167965a0334bbf71)

* Take requires_grad into account when fusing operations

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Change get_quantizer to return None if no quantization recipe is used

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Refactor pre_first_forward

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix for failing `test_make_graphed_callables[fp8_recipe0-*-True-*-linear_op]`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix linting errors

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Apply suggestions from code review

Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fix fp8 meta tensors in CUDA Graph capture

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix failing distributed userbuffers tests

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…1979)

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
…ory length changes (#1985)

* Fix bug where TE ops were not updating fp8_meta dicts

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Rename reset_recipe_state function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update error message when initializing meta device quantized weight without recipe

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix current device for cuDNN/cuBLAS handles

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use weight device and improve tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… L0 (#1990)

Fix current scaling test_helper.py and enable test_helper.py in L0

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
…on-MXFP8 recipes. (#1962)

* add manage_primitives() helper

* disable GEMM primitives for non-MXFP8 recipes

* implement the NVTE_JAX_CUSTOM_CALLS + deprecate NVTE_JAX_CUSTOM_CALLS_RE

* replace NVTE_JAX_CUSTOM_CALLS_RE with NVTE_JAX_CUSTOM_CALLS in TE tests and examples

* fix use_jax_gemm contextmanager

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Increase intermediate precision and reuse tensors from fwd

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* JIT warmup only when required

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Recompute only rsqrt_norm

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix cuDNN lib runtime loading and simplify

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Mark output tensors as not deletable in backward

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Add `in_place` kwarg to `MakeExtraOutput`

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Rename `AddInPlace` to `AddExtraInput` and add an `in_place` kwarg

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Fix cudnn versioning in support in PyTorch DPA and Fused attn

Signed-off-by: Kshitij Janardan Lakhani <klakhani@nvidia.com>
* Fixed integer overflow when computing offsets

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
…elism correctly for sequence-parallel inputs (#1980)

* updated GemmPrimitive partitioning rules to explicitly control all-reduce vs. reduce-scatter for sequence-parallelism

Signed-off-by: Alp Dener <adener@nvidia.com>

* corrected handling of FSDP sharding for the RHS operand

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use correct logical axes variable to identify sequence-parallel dim in LayerNormDenseGeneral

Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed linting issues

Signed-off-by: Alp Dener <adener@nvidia.com>

* added assert on sequence-parallel options when GemmPrimitive is disabled

Signed-off-by: Alp Dener <adener@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* optimize static grad outputs

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
* Support RMSNorm for QK

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rms -> RMSNorm, l2 -> L2Normalization (align with current pattern)

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Support LayerNorm + init refactor

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* Before/after RoPE

Signed-off-by: Evgeny <etsykunov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pylint

Signed-off-by: Evgeny <etsykunov@nvidia.com>

---------

Signed-off-by: Evgeny <etsykunov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…#1994)

* Remove deprecated device arg

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Remove test

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Fixed double buffering issue for assymetric layers

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Add ops for dropout and constant scale

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add verbosity only for failing tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Prune some tests and preinit recipe

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Prune further tests

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* fix multitensor

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Minor fixes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix a100

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* remove reciprocal op

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* Refactor Quantizer::create_tensor function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug when constructing FP8 tensor

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add quantize function to C++ quantizers

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Prototype function to coerce Python quantized tensors to match quantizer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use quantizer class in tex.quantize

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add FP8 current scaling support for activation backward

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable quantized GEMM output with FP8 current scaling

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add coerce_tensor functions for MXFP8 and DSv3

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Avoid quantizing empty tensors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use consistent shapes for FP8 transposes

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* In attention impl, construct FP8 tensors with pre-initialized scale-invs

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Initialize MXFP8 scales to zero

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Store copy of quantizer when creating quantized tensors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warnings

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Make sure quantized tensors have private quantizer

Avoid problems with in-place ops after quantizer usages are changed externally.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Rename "coerce_tensor" to "convert_and_update_tensor"

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make sure CUDA context is available when launching NVRTC kernel

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expose CUDA context creation function externally

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* Compute amax in activation kernels when the output pointer is provided, even for non-fp8 outputs

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 9f13fe2fefc58cae93bc467d87d01ecf792a0381)

* Initialize metatensor values

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* Fuse computation of amax into the activation kernel for fp8 current scaling

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 2b54327ac9c931a5340983a79e99de5caa0399dd)
Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Zero out amax in `create_hp_tensor_with_amax` instead of relying on `Float8CurrentScalingQuantizer.__init__` to zero-initialize it

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix merge conflict bug with clearing op outputs

Signed-off-by: Tim Moon <tmoon@nvidia.com>
djns99 and others added 12 commits September 16, 2025 11:32
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
* Test working as I think it should work

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* revert accidental change

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

fix merge conflict

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

bug: missed a } in the code

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Add cuBLASMp-backed GEMM-like API to TE common (#1824)

* Pick up cuBLASMp during build

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Change lib order to fix link error

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Context creation, incomplete...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Test fixure

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* A sanity AgGemm test, failing...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix axes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Take care of uneven distribution

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use MPI to get position of local matrices

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Refactor

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Refactor & fixes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Gemm-RS

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Gemm-AR, not working...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fixes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Setting all-reduce epilogue for gemm-ar

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use supported shapes for GEMM-AR

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak tolerance

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* First shot at fp8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use TensorHolder in tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More test configs

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Support comm_sm_count

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Parametrize dtypes for A, B and D separately

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak scaling

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Amax ptr

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Flags parity with cublas_gemm, saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Cleanup

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Bias tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix bias test

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Aux, saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* aux_ld

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* A fix

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use test::Tensor

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Set scale inv

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove unsupported test configs

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Replace libcal with NCCL

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Add NVTX markers to API functions

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak GemmAr tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More test config

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix merge fallout

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove MPI dependency, comment API, add algo parameter

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix nvshmem dependency

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix nvshmem build

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Excluse CommGemm tests from L0_cppunittest

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Add cpp_distributed sh file for CI

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Adapt tp TensorAllocator

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Skip GemmAr test on unsupported HW

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Oversibscribe is needed on some clusters

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix incomplete libcal removal

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Move CI tests to L1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Rename context to include NVTE prefix

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove leftover code

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* NVTE_WITH_CUBLASMP off by default

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More detailed NVTE_CHECK diag

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Comment API

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Include stdbool header for legacy C compilers

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove now unused argument

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Abstract away cuBLASMp algo behind our own enum

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More detailed shape diag messages

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/include/transformer_engine/comm_gemm.h

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>

* Add license

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

---------

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086)

* FP8 AllGather in FP8 GroupedGEMM

1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Slightly refactor

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding documents of new args.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding unit-tests.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding license.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move unit-tests to L1.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move quantizaer store/reset into FP8 only.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding all layout support for Blackwell+

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adopt the feedback from code-review.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed the wrong stream used by d2d in groupedGEMM FFI.

Signed-off-by: Ming Huang <mingh@nvidia.com>

---------

Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] Delay MeshResource validation until first usage (#2124)

Delay MeshResource validation until first usage

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] Decouple Recipe and ScalingMode (#1728)

* Decouple recipe and scaling mode

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128)

* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118)

* add amax input to DBiasQuantizePrimitive and FFI

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make sure amax is init with zero

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix sharding rule

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Temporarily remove comm_gemm tests (#2133)

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch] Disable determinism for sm100 (#2130)

* disable determinism for sm100+ and cudnn<9.14

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix remaining CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert some changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert more changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove sm100 from determinism table

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch] ONNX export of FP8 Current Scaling (#2068)

* Compute amax in normalization forward in current scaling in untuned kernels

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* apply tims suggestions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134)

use torch empty for empty shape instead of from_blob

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

build: pull cached wheels (#2127)

* build: pull cached wheels

Signed-off-by: oliver könig <okoenig@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update setup.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

feat: Add support for multiple quantization modes in the UB communicators (#2043)

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[Common] Add checks to CUDA kernel launch and CUDA API calls (#2074)

* add checks to cuda kernel launch and cuda API calls

Signed-off-by: Xin Yao <xiny@nvidia.com>

* Remove exceptions from destructors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix weired dispatch in ln/rmsnorm

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch] Support bf16+fp8 cudagraph (#2098)

* support bf16+fp8 model

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Dropout with 8-bit RNG (#2014)

* Add dropout kernel with 8-bit RNG

Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix license

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid ambiguous types

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Do not enforce dropout prob is representable in 8 bits

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expand error message

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix small statistical bug from using less-equal instead of less-than

Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warning

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary helper function in PyTorch extensions

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Create GPU reload buffers on main stream (#2131)

* Create GPU relaod buffers on main stream

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

mxfp8 unfused quant support, refined unit test, remove unecessary quantization code

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

missed a quant code removal

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

minor bug fix

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Add cuBLASMp-backed GEMM-like API to TE common (#1824)

* Pick up cuBLASMp during build

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Change lib order to fix link error

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Context creation, incomplete...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Test fixure

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* A sanity AgGemm test, failing...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix axes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Take care of uneven distribution

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use MPI to get position of local matrices

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Refactor

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Refactor & fixes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Gemm-RS

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Gemm-AR, not working...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fixes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Setting all-reduce epilogue for gemm-ar

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use supported shapes for GEMM-AR

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak tolerance

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* First shot at fp8

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use TensorHolder in tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More test configs

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Support comm_sm_count

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Parametrize dtypes for A, B and D separately

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak scaling

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Amax ptr

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Flags parity with cublas_gemm, saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Cleanup

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Bias tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix bias test

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Aux, saving...

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* aux_ld

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* A fix

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Use test::Tensor

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Set scale inv

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove unsupported test configs

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Replace libcal with NCCL

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Add NVTX markers to API functions

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Tweak GemmAr tests

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More test config

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix merge fallout

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove MPI dependency, comment API, add algo parameter

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix nvshmem dependency

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix nvshmem build

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Excluse CommGemm tests from L0_cppunittest

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Add cpp_distributed sh file for CI

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Adapt tp TensorAllocator

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Skip GemmAr test on unsupported HW

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Oversibscribe is needed on some clusters

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Fix incomplete libcal removal

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Move CI tests to L1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Rename context to include NVTE prefix

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove leftover code

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* NVTE_WITH_CUBLASMP off by default

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More detailed NVTE_CHECK diag

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Comment API

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Include stdbool header for legacy C compilers

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Remove now unused argument

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* Abstract away cuBLASMp algo behind our own enum

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* More detailed shape diag messages

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/include/transformer_engine/comm_gemm.h

Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>

* Add license

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

---------

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com>
Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com>

FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086)

* FP8 AllGather in FP8 GroupedGEMM

1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Slightly refactor

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding documents of new args.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding unit-tests.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding license.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move unit-tests to L1.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Move quantizaer store/reset into FP8 only.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding all layout support for Blackwell+

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adopt the feedback from code-review.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fixed the wrong stream used by d2d in groupedGEMM FFI.

Signed-off-by: Ming Huang <mingh@nvidia.com>

---------

Signed-off-by: Ming Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

[JAX] Delay MeshResource validation until first usage (#2124)

Delay MeshResource validation until first usage

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

[JAX] Decouple Recipe and ScalingMode (#1728)

* Decouple recipe and scaling mode

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Expose global QuantizeConfig instance as a getter

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Format and lint

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Rename UsageType to TensorSource

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

* Update test_layer.py

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>

---------

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>

[JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128)

* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118)

* add amax input to DBiasQuantizePrimitive and FFI

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make sure amax is init with zero

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* fix sharding rule

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121)

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Temporarily remove comm_gemm tests (#2133)

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[PyTorch] Disable determinism for sm100 (#2130)

* disable determinism for sm100+ and cudnn<9.14

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix remaining CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert some changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert more changes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove sm100 from determinism table

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[PyTorch] ONNX export of FP8 Current Scaling (#2068)

* Compute amax in normalization forward in current scaling in untuned kernels

Signed-off-by: Jan Bielak <jbielak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* code drop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* apply tims suggestions

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: Jan Bielak <jbielak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134)

use torch empty for empty shape instead of from_blob

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

build: pull cached wheels (#2127)

* build: pull cached wheels

Signed-off-by: oliver könig <okoenig@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update setup.py

Signed-off-by: oliver könig <okoenig@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

feat: Add support for multiple quantization modes in the UB communicators (#2043)

[Common] Add checks to CUDA kernel launch and CUDA API calls (#2074)

* add checks to cuda kernel launch and cuda API calls

Signed-off-by: Xin Yao <xiny@nvidia.com>

* Remove exceptions from destructors

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix weired dispatch in ln/rmsnorm

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

[PyTorch] Support bf16+fp8 cudagraph (#2098)

* support bf16+fp8 model

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <robinz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <robinz@nvidia.com>

---------

Signed-off-by: Robin Zhang <robinz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Dropout with 8-bit RNG (#2014)

* Add dropout kernel with 8-bit RNG

Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix license

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid ambiguous types

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Do not enforce dropout prob is representable in 8 bits

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Expand error message

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix small statistical bug from using less-equal instead of less-than

Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix linter warning

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary helper function in PyTorch extensions

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Create GPU reload buffers on main stream (#2131)

* Create GPU relaod buffers on main stream

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com>
Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>

minor code cleanup

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

minor cosmetics

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Address review comment

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

minor comment update

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Fix CI failures for UB overlap changes (#2149)

Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>

minor bug: quantizer should not be none for unfused quantization

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135)

* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the skip message

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Assert in fused attn bwd pass for sm100

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

Add check for sm100

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support to get all devs in the process for jax

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Code clean up

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* Represent attn bias using enum instead of string

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix linting error

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119)

* add noop to comp amax

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* fix for fp8 blockwise recipe

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* resolve comments

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: zhongboz <zhongboz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* Update test_multi_process_distributed_grouped_gemm.py

change accidentally added while merging

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* Update dense.py

change accidentally added while merging

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address review comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address revie comments

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Bug solved: delayed scaling quantization with mxfp8 inputs didnt work

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the unit test error

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* just to trigger ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

* fix merge conflict

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

address review comments: quantization inside gemm and outside both should exactly match for fp32 accumulation

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* add tutorial files and other local changes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove extraneous code for easy debu

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* make cuda graphs work with non-paged and paged attention

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* perf imp for kv cache ops

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add code for calibration

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* optimize kv_cache reindex and copy kernels

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* changes to make quantizers work with fp8_calibration

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* avoid reindexing from python side

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename variable from previous commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* use quantizer only if needed

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* functionality of the tutorial tested and perf checked

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove files and update headers/licenses

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* update header/license

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update tutorial for review

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make weights downloadable on the fly; remove extra print statements

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint and update comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add comma back, typo

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* sequence_start_positions should be None for training

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add paged attention numberes and update requirements.txt file

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* more fixes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make tutorial work on blackwell

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* remove gemma FT tutorial for now

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixing the headings placement and rewording attention -> kv caching

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fixes from comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the images

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* misc fixes

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add more comments to te_gemma.py and cleanup utils.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more information about the hierarchy of the classes used in the tutorial

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add better cuda graphs picture

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* addd updated cuda graphs pictures

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add illustrated cuda graphs

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* small fixes in documentation

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add torch.no_grad() to force reduced memory usage

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* some fixes from recent comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* more fixes from remaining comments

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* add te_rope_emb to class desc

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

* fix tutorial wording; add calibration fix to grouped_linear.py

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

---------

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
…allel (#2125)

* fix memory overhead of all gather from sequence parallel

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/tensor/_internal/float8_blockwise_tensor_base.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* quick fix the errors when for UB buffers

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/pytorch/module/linear.py

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Avoid deallocating FP8 scale-invs since they are reused

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Yuzhong Wang <yuzhongw@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com>
… (#2045)

* feat: add cutlass group gemm support

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor: refactor multi tensor gemm interface

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor: refactor nvte_multi_stream_cublas_gemm func and add license info

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: add unit test for cutlass group gemm

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: add cutlass support type protect

Signed-off-by: Min Yang <min.yang@shopee.com>

* add tests and fix lint

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: fix unit tests error

Signed-off-by: Min Yang <min.yang@shopee.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: refactor host workspace malloc

Signed-off-by: Min Yang <min.yang@shopee.com>

* update cutlass

Signed-off-by: Xin Yao <xiny@nvidia.com>

* update cutlass

Signed-off-by: Xin Yao <xiny@nvidia.com>

* further relex threshold and add a env var to warn fall back

Signed-off-by: Xin Yao <xiny@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Min Yang <min.yang@shopee.com>
Signed-off-by: Xin Yao <xiny@nvidia.com>
Signed-off-by: alan yang <89962857+cassiewilliam@users.noreply.github.com>
Co-authored-by: Min Yang <min.yang@shopee.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <xiny@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
feature(FA3,MLA,CP):
1. Update FA3 to commit-id 3ba6f82 (tag 2.8.0.post2 with compile error fixed), PR-1604 support hdimQK != hdimV backward
2. Update get_attention_backend method because FA3 support MLA now
3. Add CP MLA support for FA3
4. Add unit tests for FA3 MLA CP
5. Update attention doc

Signed-off-by: zhujian <zhujian.whu.cs@gmail.com>
…#2185)

* Fix cudnn version checks for kv cache for sm89. Add cudnn version check in preparation for 9.14 when getting backend

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor fix for cuDNN version condition check

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fix build ant UT issues
Upcoming ROCm and JAX 0.8 support - cherry-pick: 8e25035 03525d3 (#403)
act_x = ActOP(x, {});
dact_x = DActOP(x, {});
}
after_dact_reg[stage] = dact_x * grad_elt * gate_elt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference before and after your code change of this section?

-        after_dact_reg[stage] = dact_x * grad_elt * gate_elt;
-        after_dgate_reg[stage] = act_x * grad_elt;
+        after_act_elt = dact_x * grad_elt * gate_elt;
+        after_gate_elt = act_x * grad_elt;
+        after_dact_reg[stage] = after_act_elt;
+        after_dgate_reg[stage] = after_gate_elt;

It looks like all data are defined to be float32

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are to get after_act_elt and after_gate_elt that are used below

after_dgate_reg[stage] = after_gate_elt;
} else {
after_dact_reg[stage] = ActOP(act_elt, {}) * gate_elt;
after_act_elt = ActOP(act_elt, {}) * gate_elt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, not seeing any logical difference before and after

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason as above

if constexpr (IS_DGATED) {
// dgate
float amax = fabsf(after_dgate_reg[stage]);
float amax = fabsf(after_gate_elt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is after_gate_elt same as after_dgate_reg[stage]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not the same. Numerical truncation is performed above

}
}
float amax = fabsf(after_dact_reg[stage]);
float amax = fabsf(after_act_elt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm, you can do the downcast to after_dact_reg[stage] directly without adding after_act_elt. This can minimize the code change and also reduce the possibility of register spill

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncated after_act_elt and after_dact_elt are used more than once interleaving with original after_dact_reg[stage] so if not use those variables will need to make truncation twice: for ROWWISE and for COLWISE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But both NV kernel (https://github.com/ROCm/TransformerEngine/blob/a406914a9d17889fcc64d38e2cfed563a2b1a7b5/transformer_engine/common/util/cast_gated_kernels.cuh#L515C9-L515C26) and gtest reference (

cache_buffer_act[cached_idx] = after_dsilu;
) is taking downcasting, saving it to cache, and using the cached values afterwards.

In other words,

Copy link
Collaborator Author

@ipanfilo ipanfilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy comment for dummy GitHub letting me submit my comments

act_x = ActOP(x, {});
dact_x = DActOP(x, {});
}
after_dact_reg[stage] = dact_x * grad_elt * gate_elt;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are to get after_act_elt and after_gate_elt that are used below

after_dgate_reg[stage] = after_gate_elt;
} else {
after_dact_reg[stage] = ActOP(act_elt, {}) * gate_elt;
after_act_elt = ActOP(act_elt, {}) * gate_elt;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same reason as above

if constexpr (IS_DGATED) {
// dgate
float amax = fabsf(after_dgate_reg[stage]);
float amax = fabsf(after_gate_elt);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not the same. Numerical truncation is performed above

}
}
float amax = fabsf(after_dact_reg[stage]);
float amax = fabsf(after_act_elt);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncated after_act_elt and after_dact_elt are used more than once interleaving with original after_dact_reg[stage] so if not use those variables will need to make truncation twice: for ROWWISE and for COLWISE

@ipanfilo ipanfilo requested a review from wangye805 January 15, 2026 22:42
}
}
float amax = fabsf(after_dact_reg[stage]);
float amax = fabsf(after_act_elt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But both NV kernel (https://github.com/ROCm/TransformerEngine/blob/a406914a9d17889fcc64d38e2cfed563a2b1a7b5/transformer_engine/common/util/cast_gated_kernels.cuh#L515C9-L515C26) and gtest reference (

cache_buffer_act[cached_idx] = after_dsilu;
) is taking downcasting, saving it to cache, and using the cached values afterwards.

In other words,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In gtest reference, the downcasted cache is used:

const size_t cached_idx = (i - i_min) * tile_size_X + (j - j_min);
const float after_act = cache_buffer_act[cached_idx] * scale_reciprocal_act;
if (IS_DGATED) {
const float after_gate = cache_buffer_gate[cached_idx] * scale_reciprocal_gate;
output_colwise[i * stride + j] = static_cast<OType>(after_act);
output_colwise[i * stride + cols + j] = static_cast<OType>(after_gate);
} else {
output_colwise[i * cols + j] = static_cast<OType>(after_act);
}

Same behavior in the NV kernel:

const float scaled_out_act = block_scale_inverse_act * after_act_colwise[i];
out_act_colwise_sh[shmem_offset_elt] = static_cast<OType>(scaled_out_act);
}

@ipanfilo ipanfilo requested a review from wangye805 January 17, 2026 02:47
Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.