Enhance GroupedLinear with integrating AITER triton kernels #413

sudhu2k · 2026-01-15T16:48:39Z

Description

This PR enhances the GroupedLinear module with Triton kernel support for grouped GEMM operations, providing optimized performances. The implementation includes a complete Triton-based grouped matrix multiplication (GMM) backend that can be enabled via environment variables, along with pre-tuned configurations for optimal performance.

Added support for using Triton kernels in GroupedLinear, allowing for optimized performance based on environment variables.
Updated the setup.py to include JSON configuration files for Triton kernels in the package data.
Added a new test case for grouped GEMM functionality in the CI pipeline.
Refactored the handling of input tensors and gradients to accommodate the new Triton kernel logic.

Benchmark results:

https://github.com/ROCm/frameworks-internal/issues/13792#issuecomment-3739558113
https://github.com/ROCm/frameworks-internal/issues/13792#issuecomment-3746418683

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added Triton kernel support for GroupedLinear: Implemented a complete Triton-based grouped GEMM backend with support for dynamic kernel selection based on environment variables (NVTE_USE_GROUPED_GEMM_TRITON)
Added optional m_splits_tensor parameter to keep tensor data on GPU and avoid redundant CPU-GPU data transfers for improved performance
New GMM (Grouped Matrix Multiplication) module from AITER: Added comprehensive Triton kernel implementation in transformer_engine/pytorch/triton_kernels/gmm/ from AITER including:
- gmm_common.py: Common utilities and helper functions
- gmm_kernels.py: Core Triton kernel implementations for grouped GEMM operations
- gmm_wrapper.py: High-level wrapper functions from AITER
- pid_preprocessing.py: Process ID preprocessing for efficient kernel scheduling
Pre-tuned configurations: Added JSON configuration files for AMD GPU architectures:
- gfx942-GMM.json: pre-tuned configs for gfx942 arch
- gfx950-GMM.json: pre-tuned configs for gfx950 arch
Updated setup.py: Modified package data to include JSON configuration files for Triton kernels
Enhanced GroupedLinear module: Refactored grouped_linear.py to support Triton kernel path with proper tensor handling.
Added grouped_gemm.py wrapper: Created high-level interface in TE for grouped GEMM operations
Extended common utilities: Added Triton kernel support flags in triton_kernels/common.py
New test suite: Added comprehensive test cases (From AITER) in tests/pytorch/triton_kernels/test_grouped_gemm.py (516 lines)
CI integration: Updated ci/pytorch.sh to include grouped GEMM tests in the CI pipeline

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…for JSON configs - Added support for using Triton kernels in GroupedLinear, allowing for optimized performance based on environment variables. - Updated the setup.py to include JSON configuration files for Triton kernels in the package data. - Added a new test case for grouped GEMM functionality in the CI pipeline. - Refactored the handling of input tensors and gradients to accommodate the new Triton kernel logic.

…cy test

ipanfilo · 2026-01-17T03:21:03Z

Update copyright date of modified files

ipanfilo · 2026-01-17T05:30:28Z

ci/pytorch.sh

    run_default_fa 1 triton_kernels/test_cast_mxfp8.py
    run_default_fa 1 triton_kernels/test_norm_common.py
    run_default_fa 1 triton_kernels/test_norms.py
+    run_default_fa 1 triton_kernels/test_grouped_gemm.py


Please move it two lines higher, alphabetical sort helps to find tests

ipanfilo · 2026-01-17T05:33:02Z

tests/pytorch/test_numerics.py

+    delay_wgrad_compute,
+    parallel_mode=None,
+):
+    os.environ["NVTE_USE_GROUPED_GEMM_TRITON"] = "1"


This env won't be cleared if the test is skipped of failed

ipanfilo · 2026-01-17T05:36:57Z

transformer_engine/pytorch/module/grouped_linear.py

        else:
-            inputmats = torch.split(cast_if_needed(inp_view, activation_dtype), m_splits)
-
+            if not use_grouped_gemm_triton:


make it elif

ipanfilo · 2026-01-17T05:48:10Z

transformer_engine/pytorch/triton_kernels/grouped_gemm.py

+            group_sizes_list=kwargs.get("m_splits_list", []),
+        )
+
+        grad_biases = [None] * len(m_splits) if bias is None else bias


m_splits.shape[0] or len(m_splits_list)?

ipanfilo · 2026-01-17T05:50:26Z

setup.py

-        package_data = {"": ["VERSION.txt"]}
+        package_data = {
+            "": ["VERSION.txt"],
+            "transformer_engine.pytorch.triton_kernels.gmm": ["configs/*.json"],


They should be part of pytorch extension installation not TE core

ipanfilo · 2026-01-17T06:05:28Z

transformer_engine/pytorch/module/grouped_linear.py

-        _ = general_grouped_gemm(
+        general_grouped_gemm_func = general_grouped_gemm_triton if use_grouped_gemm_triton else general_grouped_gemm
+        # Prepare m_splits for each backend
+        m_splits_for_kernel = m_splits


It may be more straightforward to keep m_splits as-is and add mandatory parameter m_splits_tensor or m_splits_for_kernel to general_grouped_gemm_triton(), instead of swapping them here.

sugovind added 2 commits January 15, 2026 16:23

Added copyright and fixed env var for triton grouped gemm

4c97e84

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners January 15, 2026 16:48

sugovind added 4 commits January 15, 2026 19:18

Added grouped_linear module test with triton

f029d98

Fix for unit test, set back the env variable to 0

d50f86a

Update numerical test tolerances for float32 in grouped linear accura…

bf8e167

…cy test

Relaxed rtol tolerance for float32 in grouped linear accuracy test

ba11350

ipanfilo reviewed Jan 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance GroupedLinear with integrating AITER triton kernels #413

Enhance GroupedLinear with integrating AITER triton kernels #413

Uh oh!

sudhu2k commented Jan 15, 2026

Uh oh!

ipanfilo commented Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enhance GroupedLinear with integrating AITER triton kernels #413

Are you sure you want to change the base?

Enhance GroupedLinear with integrating AITER triton kernels #413

Uh oh!

Conversation

sudhu2k commented Jan 15, 2026

Description

Benchmark results:

Type of change

Changes

Checklist:

Uh oh!

ipanfilo commented Jan 17, 2026

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants