Hotfix/fused ce triton #409

sarthak-amd · 2026-01-12T04:38:42Z

Kernel fixes:

Compute valid token count on host: denom = max(1, (target != ignore_idx).sum())
Scale loss and gradients by denom instead of n_rows
Add grad_output_stride parameter, compute dynamically: 1 if grad_output.numel() > 1 else 0
Add is_cg_capturable flag for CUDA graph compatibility

Test improvements:

Add explicit loss value assertions vs torch.nn.CrossEntropyLoss
Use torch.square(loss) for non-trivial backward
Apply dtype-aware tolerances

…ling

ipanfilo · 2026-01-12T15:06:04Z

transformer_engine/pytorch/cross_entropy.py

What is a purpose of this file change?

ipanfilo · 2026-01-12T15:06:23Z

transformer_engine/pytorch/triton/cross_entropy.py

Modify copyright date. The file is also maintained in upstream so avoid unnecessary reformattings

ipanfilo · 2026-01-12T15:14:21Z

transformer_engine/pytorch/triton/cross_entropy.py

    rank (int): The rank of this device in the TP group.
    world_size (int): The size of world involved in this distributed loss calculation.
-    ignore_idx (int): Tokens to be ignored for loss and gradient calculation.
+    ignore_idx (int): Tokens to be ignored for loss and gradient calculation. (default -100)


There is no default here.

ipanfilo · 2026-01-12T15:19:53Z

transformer_engine/pytorch/triton/cross_entropy.py


-def cross_entropy_backward(_input: torch.Tensor, grad_output: torch.Tensor):
+def cross_entropy_backward(
+    _input: torch.Tensor, grad_output: torch.Tensor, is_cg_capturable: bool = False


This code interferes and conflicts with upcoming IFU 2.8

wenchenvincent

@sarthak-amd As mentioned in the previous PR, could you refactor the PR as 3 commits:

2 commits would be cherrypicking from the upstream PRs. (NVIDIA/TransformerEngine#1879, NVIDIA/TransformerEngine#2139)
1 commit for the ignore_idx with a test to cover it.
This way the PR would be very clear and easy to understand.

ipanfilo · 2026-01-14T04:40:18Z

@sarthak-amd As mentioned in the previous PR, could you refactor the PR as 3 commits:

* 2 commits would be cherrypicking from the upstream PRs. ([NVIDIA/TransformerEngine#1879](https://github.com/NVIDIA/TransformerEngine/pull/1879), [NVIDIA/TransformerEngine#2139](https://github.com/NVIDIA/TransformerEngine/pull/2139))

* 1 commit for the ignore_idx with a test to cover it.
  This way the PR would be very clear and easy to understand.

The one of aforementioned PRs is part of IFU 2.6, i.e. part of ROCm TE already, the other is part of IFU 2.8

sarthak-amd added 2 commits January 11, 2026 22:35

Fix fused cross-entropy ignore-index scaling and gradient stride hand…

276654a

…ling

Add test case

d853f8e

sarthak-amd requested review from ipanfilo, wangye805 and wenchenvincent as code owners January 12, 2026 04:38

ipanfilo requested changes Jan 12, 2026

View reviewed changes

wenchenvincent requested changes Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hotfix/fused ce triton #409

Hotfix/fused ce triton #409

Uh oh!

sarthak-amd commented Jan 12, 2026

Uh oh!

ipanfilo Jan 12, 2026

Uh oh!

ipanfilo Jan 12, 2026

Uh oh!

ipanfilo Jan 12, 2026

Uh oh!

ipanfilo Jan 12, 2026

Uh oh!

wenchenvincent left a comment

Uh oh!

ipanfilo commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hotfix/fused ce triton #409

Are you sure you want to change the base?

Hotfix/fused ce triton #409

Uh oh!

Conversation

sarthak-amd commented Jan 12, 2026

Uh oh!

ipanfilo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

wenchenvincent left a comment

Choose a reason for hiding this comment

Uh oh!

ipanfilo commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants