Fix Chakra Errors and Update ETFeeder #2

JoongunPark · 2025-01-20T19:23:32Z

Summary

This PR addresses multiple issues in the Chakra converter:

1. Improper Handling of NCCL All-to-All Communication

Chakra incorrectly distinguishes between point-to-point and collective communication. In NCCL, all-to-all is implemented as point-to-point communication, but Chakra's current logic treats these as distinct, leading to an incorrect type for PyTorchNode. More details on NCCL point-to-point can be found here.

2. Logging Inconsistency

There was a mismatch in logging levels: sync dependencies log via logging.info, while other dependencies use logging.debug. This PR resolves the inconsistency by standardizing the logging approach.

3. False Positive Dependencies from HTA

HTA returns false positives for sync dependencies, leading to invalid later op -> earlier op dependencies. This causes Chakra to fail in certain traces. The Chakra converter was found to encounter two critical failures:

Cycle dependencies
Stack overflows (due to call stacks exceeding 1000 levels)

4. Update trace_linker to use external_id for finding GPU op's parent CPU op

There were many operations matched with wrong parent CPU during trace linking.
This PR solves this problem using external_id instead of ev_idx.

5. Handling HTA Errors in Chakra

The trace linker was terminating unexpectedly due to errors in HTA. Although this may stem from trace inconsistencies, the issue does not occur when HTA is excluded.
Updated Chakra to handle these errors by raising exceptions instead of terminating the trace linker.

6. Proper Encoding of pg_name in Collective Operations

Identified an issue where SendRecv, Reduce-Scatter and All-Gather operations do not correctly encode pg_name following updates on the PyTorch side.
Modified Chakra to ensure proper encoding of pg_name in these collective operations.

7. Getter in ETFeeder

Updated ETFeeder to have getter functions of I/O attributes.
The I/O attributes include value/shape/type for the node.

Node that this feature is also required in other code in Feeder ( json_node.cpp json_node.h wrapper_node.cpp wrapper_node.h) which can be done after we decide details of JSON format.

Test Plan

I tested the fixes using Mixtral 8x3B traces collected with the NeMo framework (NVIDIA).
traces_device_0.zip

#!/bin/bash
# Set the result path
PATH="~/scratch/results/mixtral_8x3b/results"

# Loop through trace ranks
for i in 0
do
    echo "Start linking trace: $i"
    chakra_trace_link \
        --chakra-host-trace $PATH/host_$i.json \
        --chakra-device-trace $PATH/device_$i.json  \
        --rank $i \
        --output-file $PATH/rank_$i.json

    echo "Start converting trace: $i"
    chakra_converter PyTorch \
        --input $PATH/rank_$i.json \
        --output $PATH/rank_$i.et
done

fix lint errors fix lint errors fix lint errors

Without specifying the kineto filepath explicitly, HTA may pick arbitrary files from the `trace_dir` and either provide incorrect analysis results, or fail in some weird ways.

…neto-file-explicitly Specify the kineto filepath explicitly when running HTA analysis

JSON format for Chakra ET

…U op

rvinaybharadwaj and others added 30 commits October 7, 2024 17:44

Add JSON support + wrapper

1559bcf

Rebasing with main

649c37b

adding json test data

8e311bc

adding wrapper tests

6054b0b

code cleanup: make class members private

f93765f

Updating tests

8351938

changing datatypes to match protobuf

9796253

adding missed datatype changes

0ce0ae2

updating WrapperTests datatypes

67d29cd

removing involved dims and minor bug fix

045bd6f

fixing include path in et_feeder

e5a4627

add missing break statement

46d42d6

minor bug fixes

69dffb1

fix include path

3b7a0ad

fix lint errors

6cbbc4f

merging et_feeder_node

124de08

adding install setuptools to github workflows

274826f

updating workflows

21417d8

updating workflows

ad998fc

updating cpp_lint.yml

bf61b64

updating clang-format version

2474103

updating clang-format version

3a7c2c9

fix lint errors

7756ec2

fix lint errors fix lint errors fix lint errors

Fix rebase error

f439bf4

addressing reviewer comments

58fda3e

fix lint errors

61c74d8

fix lint errors

f803f33

Specify the kineto filepath explicitly when running HTA analysis

6ed6e66

Without specifying the kineto filepath explicitly, HTA may pick arbitrary files from the `trace_dir` and either provide incorrect analysis results, or fail in some weird ways.

Merge pull request mlcommons#167 from flexaihq/alexdenisov/specify-ki…

b915ab8

…neto-file-explicitly Specify the kineto filepath explicitly when running HTA analysis

Merge pull request mlcommons#145 from rvinaybharadwaj/jsonify

9247489

JSON format for Chakra ET

willjwon and others added 11 commits January 20, 2025 14:07

Update is_cpu_op to default to false

4fb397e

Fix mishandling All-to-All communication

40ce3be

Update logging.info to logging.debug to make it consistent

470bde1

Eliminate false positive sync dependency

ed7e286

PyTorch nightly needs to support 1.1.1-chakra.0.0.4.

b3dca0b

Get pg_name from record_param_comms for collectives

cc660d0

Update trace_linker to use external_id for finding GPU op's parent CP…

c7c7c05

…U op

Handling HTA Errors in Chakra

6d8dea8

Fix error encoding METADATA node

f51050d

Implement getter functions for nodes' inputs/outputs

810ff88

Merge branch 'main' into develop

7883df1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Chakra Errors and Update ETFeeder #2

Fix Chakra Errors and Update ETFeeder #2

Uh oh!

JoongunPark commented Jan 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix Chakra Errors and Update ETFeeder #2

Are you sure you want to change the base?

Fix Chakra Errors and Update ETFeeder #2

Uh oh!

Conversation

JoongunPark commented Jan 20, 2025

Summary

1. Improper Handling of NCCL All-to-All Communication

2. Logging Inconsistency

3. False Positive Dependencies from HTA

4. Update trace_linker to use external_id for finding GPU op's parent CPU op

5. Handling HTA Errors in Chakra

6. Proper Encoding of pg_name in Collective Operations

7. Getter in ETFeeder

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants