Skip to content

Conversation

@jingxu9x
Copy link
Contributor

In the end, rank 0 will save ckpt, and other rank destory_process_group directly, this will cause rank 0 raise exception like

[rank6]:[E722 07:46:47.681877747 ProcessGroupNCCL.cpp:542] [Rank 6] Collective WorkNCCL(SeqNum=1184, OpType=ALLREDUCE, NumelIn=18816, NumelOut=18816, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.23.4
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

To solve it, we need all rank do destroy_process_group at same time.

@jingxu9x jingxu9x requested review from photoszzt and suyuee July 22, 2025 08:09
@suyuee suyuee merged commit 8894f6c into main Jul 22, 2025
1 check failed
@suyuee suyuee deleted the fix/pytorch_example_raise_system_error branch July 22, 2025 08:12
Copy link

@suyuee suyuee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants