Skip to content

using cuda_graph memory increased a lot and oom at torch.autograd.grad #2543

@chengmengli06

Description

@chengmengli06

using H200 with 140G memory, if cudagraph is not used, only 40% occupancy, but with cudagraph, it always oom in make_graphed_callables:

transformer_engine 2.8.0.dev0+c47f329
torch 2.8.0a0+5228986c39.nv25.6
Megatron-LM: commit=4193f3aa9b7d8932d286a6753f0e62467404fbb8

[rank9]: Traceback (most recent call last):
[rank9]: File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in
[rank9]: pretrain(
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank9]: iteration, num_floating_point_operations_so_far = train(
[rank9]: ^^^^^^
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank9]: cuda_graph_helper.create_cudagraphs()
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank9]: graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank9]: graphed_callables = _make_graphed_callables(
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank9]: grad_inputs = torch.autograd.grad(
[rank9]: ^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py", line 502, in grad
[rank9]: result = _engine_run_backward(
[rank9]: ^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank9]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank9]: return user_fn(self, *args)
[rank9]: ^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank9]: dq, dk, dv, *rest = fused_attn_bwd(
[rank9]: ^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank9]: output_tensors = tex.fused_attn_bwd(
[rank9]: ^^^^^^^^^^^^^^^^^^^
[rank9]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB. GPU 1 has a total capacity of 139.81 GiB of which 153.19 MiB is free. Process 4788 has 139.64 GiB memory in use. Of the allocated memory 131.05 GiB is allocated by PyTorch, and 2.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank10]: Traceback (most recent call last):
[rank10]: File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in
[rank10]: pretrain(
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank10]: iteration, num_floating_point_operations_so_far = train(
[rank10]: ^^^^^^
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank10]: cuda_graph_helper.create_cudagraphs()
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank10]: graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank10]: graphed_callables = _make_graphed_callables(
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank10]: grad_inputs = torch.autograd.grad(
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py", line 502, in grad
[rank10]: result = _engine_run_backward(
[rank10]: ^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank10]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank10]: return user_fn(self, *args)
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank10]: dq, dk, dv, *rest = fused_attn_bwd(
[rank10]: ^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank10]: output_tensors = tex.fused_attn_bwd(
[rank10]: ^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 139.81 GiB of which 11.19 MiB is free. Process 4789 has 139.78 GiB memory in use. Of the allocated memory 131.07 GiB is allocated by PyTorch, and 2.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1225 16:13:33.027000 28 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 93 closing signal SIGTERM
Stack (most recent call first):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions