using cuda_graph memory increased a lot and oom at torch.autograd.grad

using H200 with 140G memory, if cudagraph is not used, only 40% occupancy, but with cudagraph, it always oom   in make_graphed_callables:

transformer_engine         2.8.0.dev0+c47f329
torch                      2.8.0a0+5228986c39.nv25.6
Megatron-LM:     commit=4193f3aa9b7d8932d286a6753f0e62467404fbb8


[rank9]: Traceback (most recent call last):
[rank9]:   File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in <module>
[rank9]:     pretrain(
[rank9]:   File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank9]:     iteration, num_floating_point_operations_so_far = train(
[rank9]:                                                       ^^^^^^
[rank9]:   File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank9]:     cuda_graph_helper.create_cudagraphs()
[rank9]:   File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank9]:     graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank9]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank9]:     graphed_callables = _make_graphed_callables(
[rank9]:                         ^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank9]:     grad_inputs = torch.autograd.grad(
[rank9]:                   ^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 502, in grad
[rank9]:     result = _engine_run_backward(
[rank9]:              ^^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank9]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank9]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank9]:     return user_fn(self, *args)
[rank9]:            ^^^^^^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank9]:     dq, dk, dv, *rest = fused_attn_bwd(
[rank9]:                         ^^^^^^^^^^^^^^^
[rank9]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank9]:     output_tensors = tex.fused_attn_bwd(
[rank9]:                      ^^^^^^^^^^^^^^^^^^^
[rank9]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB. GPU 1 has a total capacity of 139.81 GiB of which 153.19 MiB is free. Process 4788 has 139.64 GiB memory in use. Of the allocated memory 131.05 GiB is allocated by PyTorch, and 2.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank10]: Traceback (most recent call last):
[rank10]:   File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in <module>
[rank10]:     pretrain(
[rank10]:   File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank10]:     iteration, num_floating_point_operations_so_far = train(
[rank10]:                                                       ^^^^^^
[rank10]:   File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank10]:     cuda_graph_helper.create_cudagraphs()
[rank10]:   File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank10]:     graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank10]:     graphed_callables = _make_graphed_callables(
[rank10]:                         ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank10]:     grad_inputs = torch.autograd.grad(
[rank10]:                   ^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 502, in grad
[rank10]:     result = _engine_run_backward(
[rank10]:              ^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank10]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank10]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank10]:     return user_fn(self, *args)
[rank10]:            ^^^^^^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank10]:     dq, dk, dv, *rest = fused_attn_bwd(
[rank10]:                         ^^^^^^^^^^^^^^^
[rank10]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank10]:     output_tensors = tex.fused_attn_bwd(
[rank10]:                      ^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 139.81 GiB of which 11.19 MiB is free. Process 4789 has 139.78 GiB memory in use. Of the allocated memory 131.07 GiB is allocated by PyTorch, and 2.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1225 16:13:33.027000 28 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 93 closing signal SIGTERM
Stack (most recent call first):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

using cuda_graph memory increased a lot and oom at torch.autograd.grad #2543

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

using cuda_graph memory increased a lot and oom at torch.autograd.grad #2543

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions