add collective dependencies as data_deps #191

theodorbadea · 2025-04-03T12:24:57Z

Currently, there is no explicit dependency between collectives belonging to the same process group. Their sequential execution is deducted only from the duration of their parent compute node and their own duration.

This PR proposes to append to the data_deps list the previous collective belonging to the same PG.
I am attaching an example containing 5 All_Reduce in the same PG. Before the proposed change, each collective had only one node in the data_deps, i.e. the parent COMP_NODE named "record_param_comms". After the change, the first All_Reduce in the PG remains unchanged, but the next ones will have 2 nodes in data_deps: the COMP_NODE "record_param_comms" and the previous collective communication node.

Also attaching part of a diagram showing the collectives without explicit dependencies (they are all leaf nodes) and with the newly added dependencies. Cropped out nodes from above the collectives are the parent COMP_NODE.
kineto.json
new_chakra_jsonized.json
original_chakra_jsonized.json
p_trace_rank=0_.json

github-actions · 2025-04-03T12:25:11Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

JoongunPark · 2025-04-07T19:05:30Z

Hi @theodorbadea . Thank you for this PR.
I think adding data dependency between collective which have the same PG name is good feature to have.
Let me take a look and run some test and get back to you.

Also, I need to understand better regarding the redundant check.

theodorbadea · 2025-04-08T08:40:37Z

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

JoongunPark · 2025-04-17T15:53:20Z

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

Ah Yes. You are right. I agree that original logic seems bit redundant and hard to understand.
This PR looks good to me. I will be back soon after few testings. Thank you!

JoongunPark · 2025-05-15T20:58:40Z

Hey @JoongunPark , thanks for your feedback on this PR. That check is redundant because child nodes are only added to the stack if they are not in the visited list:

Ah Yes. You are right. I agree that original logic seems bit redundant and hard to understand. This PR looks good to me. I will be back soon after few testings. Thank you!

Thank you for your patient @theodorbadea ! And I am sorry for delayed testing.
I have tested this PR with three different traces. Resnet, GPT3, and DeepSeek.
This code terminates without error and the results are running well on my ASTRA-Sim.
This PR looks good to me.
Could you please check and approve this PR @tushar-krishna @srinivas212 ? Thank you!

Result log from ASTRA-Sim

GPT3
sys[0] finished, 4876814000 cycles, exposed communication 10158000 cycles.
num_cpu_ops: 78613
num_gpu_ops: 9813
num_gpu_comms: 603
tics_cpu_ops: 4617349000
tics_gpu_ops: 4866656000
tics_gpu_comms: 753759692
Collective: ALL_REDUCE, # ops: 9, Bytes: 36, runtime: 474444
Collective: ALL_GATHER, # ops: 392, Bytes: 19761348608, runtime: 343905160
Collective: BROADCAST, # ops: 2, Bytes: 12, runtime: 11000
Collective: REDUCE_SCATTER, # ops: 200, Bytes: 161521860608, runtime: 409369088

DeepSeek
sys[0] finished, 27726132000 cycles, exposed communication 25816064000 cycles.
num_cpu_ops: 404788
num_gpu_ops: 80431
num_gpu_comms: 1792
tics_cpu_ops: 27726131000
tics_gpu_ops: 1910068000
tics_gpu_comms: 915764527
Collective: ALL_REDUCE, # ops: 18, Bytes: 136, runtime: 948888
Collective: ALL_GATHER, # ops: 23, Bytes: 284717568, runtime: 5603986
Collective: ALL_TO_ALL, # ops: 1728, Bytes: 86973087744, runtime: 897208704
Collective: REDUCE_SCATTER, # ops: 23, Bytes: 4555481088, runtime: 12002949

Resnet
sys[0] finished, 33805413000 cycles, exposed communication 30153502000 cycles.
num_cpu_ops: 5294
num_gpu_ops: 5971
num_gpu_comms: 3
tics_cpu_ops: 33805413000
tics_gpu_ops: 3651911000
tics_gpu_comms: 710708
Collective: ALL_REDUCE, # ops: 1, Bytes: 102228128, runtime: 516708
Collective: BROADCAST, # ops: 2, Bytes: 212904, runtime: 194000

JoongunPark · 2025-06-02T17:47:10Z

This code terminates without error and the results are running well on my ASTRA-Sim.

After discussion with the group in our last Chakra meeting, we concluded to have more discussion with this PR.

9LLPPLL6 · 2025-06-10T07:42:34Z

This code terminates without error and the results are running well on my ASTRA-Sim.

After discussion with the group in our last Chakra meeting, we concluded to have more discussion with this PR.

@theodorbadea @JoongunPark
Hello everyone. Have you understood how chakra handles data dependencies? I think in chakra, the convert_ctrl_dep_to_data_dep function only converts ctrl_dep to data_dep. So in the final et file, data_dep represents the calling and being called relationship between OPs, rather than the dependency relationship between data stream, for example: I trained a model with the configuration of pp=4 in 4 ranks, and many isend and irecv were used. At the beginning, rank1 needs to wait for the corresponding data received after the calculation of rank0 is completed before starting the calculation. Therefore, this dependency relationship of data streams is very important for simulating pipeline parallel. Without this dependency, the four ranks would be carried out simultaneously, which is obviously problematic. What do you think about this? Looking forward to your answers.

9LLPPLL6 · 2025-06-13T12:20:07Z

Currently, there is no explicit dependency between collectives belonging to the same process group. Their sequential execution is deducted only from the duration of their parent compute node and their own duration.

This PR proposes to append to the data_deps list the previous collective belonging to the same PG. I am attaching an example containing 5 All_Reduce in the same PG. Before the proposed change, each collective had only one node in the data_deps, i.e. the parent COMP_NODE named "record_param_comms". After the change, the first All_Reduce in the PG remains unchanged, but the next ones will have 2 nodes in data_deps: the COMP_NODE "record_param_comms" and the previous collective communication node.

Also attaching part of a diagram showing the collectives without explicit dependencies (they are all leaf nodes) and with the newly added dependencies. Cropped out nodes from above the collectives are the parent COMP_NODE. kineto.json new_chakra_jsonized.json original_chakra_jsonized.json p_trace_rank=0_.json

I observed the trace collected in the data parallel distributed training. I found that when different processes belonging to the same communication group perform allreduce, the recorded pg_name may not be equal. Therefore, it is not very accurate to judge the dependency based on pg_name
The following are the traces I collected during the training with dp=2 and pp=2 using Megatron-LM in the 4xA6000

kineto_trace_zbpp_gpt.0.json
kineto_trace_zbpp_gpt.1.json
kineto_trace_zbpp_gpt.2.json
kineto_trace_zbpp_gpt.3.json

theodorbadea · 2025-06-13T12:41:20Z

@9LLPPLL6 my understanding is that two collectives from the same PG cannot overlap and in reality, at least in nccl's case, they don't because nccl takes care of scheduling them. We faced the scenario of encountering overlapping collectives in emulation and this is why I proposed this change, to explicitly capture these dependencies. There have been shared opinions about the fact that the engine should be responsible for this scheduling. In your traces, could you notice overlaps in terms of start+duration times between collectives in the same PG and collectives from different PGs?

9LLPPLL6 · 2025-06-14T12:40:06Z

@9LLPPLL6 my understanding is that two collectives from the same PG cannot overlap and in reality, at least in nccl's case, they don't because nccl takes care of scheduling them. We faced the scenario of encountering overlapping collectives in emulation and this is why I proposed this change, to explicitly capture these dependencies. There have been shared opinions about the fact that the engine should be responsible for this scheduling. In your traces, could you notice overlaps in terms of start+duration times between collectives in the same PG and collectives from different PGs?

Perhaps there was a deviation in my understanding.Thank you for your answer. I understand what you mean

jinsun-yoo · 2025-09-29T20:33:20Z

Closing the PR for now. The consensus was that we will keep the notion that one processgroup can issue one collective at any given time. Let's reopen if we want to revisit.

theodorbadea requested a review from a team as a code owner April 3, 2025 12:24

JoongunPark mentioned this pull request May 23, 2025

preserve collectives #190

Closed

theodorbadea added 3 commits September 8, 2025 12:54

add collective dependencies as data_deps

d218c3c

fix lint error

d0f259a

remove redundant check

89d8266

theodorbadea force-pushed the add-explicit-dependencies-between-collectives-in-the-same-pg branch from fef88bb to 89d8266 Compare September 8, 2025 12:55

theodorbadea marked this pull request as draft September 8, 2025 12:57

jinsun-yoo closed this Sep 29, 2025

github-actions bot locked and limited conversation to collaborators Sep 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add collective dependencies as data_deps #191

add collective dependencies as data_deps #191

Uh oh!

theodorbadea commented Apr 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 3, 2025 •

edited

Loading

Uh oh!

JoongunPark commented Apr 7, 2025 •

edited

Loading

Uh oh!

theodorbadea commented Apr 8, 2025

Uh oh!

JoongunPark commented Apr 17, 2025 •

edited

Loading

Uh oh!

JoongunPark commented May 15, 2025 •

edited

Loading

Uh oh!

JoongunPark commented Jun 2, 2025

Uh oh!

9LLPPLL6 commented Jun 10, 2025

Uh oh!

9LLPPLL6 commented Jun 13, 2025

Uh oh!

theodorbadea commented Jun 13, 2025

Uh oh!

9LLPPLL6 commented Jun 14, 2025

Uh oh!

jinsun-yoo commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add collective dependencies as data_deps #191

add collective dependencies as data_deps #191

Uh oh!

Conversation

theodorbadea commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoongunPark commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theodorbadea commented Apr 8, 2025

Uh oh!

JoongunPark commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoongunPark commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Result log from ASTRA-Sim

Uh oh!

JoongunPark commented Jun 2, 2025

Uh oh!

9LLPPLL6 commented Jun 10, 2025

Uh oh!

9LLPPLL6 commented Jun 13, 2025

Uh oh!

theodorbadea commented Jun 13, 2025

Uh oh!

9LLPPLL6 commented Jun 14, 2025

Uh oh!

jinsun-yoo commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

theodorbadea commented Apr 3, 2025 •

edited

Loading

github-actions bot commented Apr 3, 2025 •

edited

Loading

JoongunPark commented Apr 7, 2025 •

edited

Loading

JoongunPark commented Apr 17, 2025 •

edited

Loading

JoongunPark commented May 15, 2025 •

edited

Loading