-
Notifications
You must be signed in to change notification settings - Fork 588
Description
using H200 with 140G memory, if cudagraph is not used, only 40% occupancy, but with cudagraph, it always oom in make_graphed_callables:
transformer_engine 2.8.0.dev0+c47f329
torch 2.8.0a0+5228986c39.nv25.6
Megatron-LM: commit=4193f3aa9b7d8932d286a6753f0e62467404fbb8
[rank9]: Traceback (most recent call last):
[rank9]: File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in
[rank9]: pretrain(
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank9]: iteration, num_floating_point_operations_so_far = train(
[rank9]: ^^^^^^
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank9]: cuda_graph_helper.create_cudagraphs()
[rank9]: File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank9]: graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank9]: graphed_callables = _make_graphed_callables(
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank9]: grad_inputs = torch.autograd.grad(
[rank9]: ^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py", line 502, in grad
[rank9]: result = _engine_run_backward(
[rank9]: ^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank9]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank9]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank9]: return user_fn(self, *args)
[rank9]: ^^^^^^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank9]: dq, dk, dv, *rest = fused_attn_bwd(
[rank9]: ^^^^^^^^^^^^^^^
[rank9]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank9]: output_tensors = tex.fused_attn_bwd(
[rank9]: ^^^^^^^^^^^^^^^^^^^
[rank9]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 194.00 MiB. GPU 1 has a total capacity of 139.81 GiB of which 153.19 MiB is free. Process 4788 has 139.64 GiB memory in use. Of the allocated memory 131.05 GiB is allocated by PyTorch, and 2.49 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank10]: Traceback (most recent call last):
[rank10]: File "/mnt/workspace/Megatron-LM/pretrain_gpt.py", line 260, in
[rank10]: pretrain(
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 787, in pretrain
[rank10]: iteration, num_floating_point_operations_so_far = train(
[rank10]: ^^^^^^
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/training/training.py", line 2422, in train
[rank10]: cuda_graph_helper.create_cudagraphs()
[rank10]: File "/mnt/workspace/Megatron-LM/megatron/core/transformer/cuda_graphs.py", line 1839, in create_cudagraphs
[rank10]: graphs = make_graphed_callables(tuple(self.flattened_callables), sample_args, **kwargs)
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 990, in make_graphed_callables
[rank10]: graphed_callables = _make_graphed_callables(
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/graph.py", line 399, in _make_graphed_callables
[rank10]: grad_inputs = torch.autograd.grad(
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/init.py", line 502, in grad
[rank10]: result = _engine_run_backward(
[rank10]: ^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank10]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank10]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 307, in apply
[rank10]: return user_fn(self, *args)
[rank10]: ^^^^^^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 1265, in backward
[rank10]: dq, dk, dv, *rest = fused_attn_bwd(
[rank10]: ^^^^^^^^^^^^^^^
[rank10]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/fused_attn.py", line 448, in fused_attn_bwd
[rank10]: output_tensors = tex.fused_attn_bwd(
[rank10]: ^^^^^^^^^^^^^^^^^^^
[rank10]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 2 has a total capacity of 139.81 GiB of which 11.19 MiB is free. Process 4789 has 139.78 GiB memory in use. Of the allocated memory 131.07 GiB is allocated by PyTorch, and 2.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W1225 16:13:33.027000 28 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 93 closing signal SIGTERM
Stack (most recent call first):