Skip to content

Conversation

@sarthak-amd
Copy link
Collaborator

Kernel fixes:

  • Compute valid token count on host: denom = max(1, (target != ignore_idx).sum())
  • Scale loss and gradients by denom instead of n_rows
  • Add grad_output_stride parameter, compute dynamically: 1 if grad_output.numel() > 1 else 0
  • Add is_cg_capturable flag for CUDA graph compatibility

Test improvements:

  • Add explicit loss value assertions vs torch.nn.CrossEntropyLoss
  • Use torch.square(loss) for non-trivial backward
  • Apply dtype-aware tolerances

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a purpose of this file change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify copyright date. The file is also maintained in upstream so avoid unnecessary reformattings

rank (int): The rank of this device in the TP group.
world_size (int): The size of world involved in this distributed loss calculation.
ignore_idx (int): Tokens to be ignored for loss and gradient calculation.
ignore_idx (int): Tokens to be ignored for loss and gradient calculation. (default -100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no default here.


def cross_entropy_backward(_input: torch.Tensor, grad_output: torch.Tensor):
def cross_entropy_backward(
_input: torch.Tensor, grad_output: torch.Tensor, is_cg_capturable: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code interferes and conflicts with upcoming IFU 2.8

Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthak-amd As mentioned in the previous PR, could you refactor the PR as 3 commits:

@ipanfilo
Copy link
Collaborator

@sarthak-amd As mentioned in the previous PR, could you refactor the PR as 3 commits:

* 2 commits would be cherrypicking from the upstream PRs. ([NVIDIA/TransformerEngine#1879](https://github.com/NVIDIA/TransformerEngine/pull/1879), [NVIDIA/TransformerEngine#2139](https://github.com/NVIDIA/TransformerEngine/pull/2139))

* 1 commit for the ignore_idx with a test to cover it.
  This way the PR would be very clear and easy to understand.

The one of aforementioned PRs is part of IFU 2.6, i.e. part of ROCm TE already, the other is part of IFU 2.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants