Skip to content

NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link #161

@alexseceks

Description

@alexseceks

Describe the Bug

After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?

I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?

Steps to Reproduce

Using the Chakra version from 6 Sept, after the merge of commit #140.

Expected Behavior

See the nccl:broadcast collective in the converted trace.

Screenshots

This is the trace_link file, the broadcast collective is present.
Screenshot 2024-10-16 at 14 15 15
This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture.
Screenshot 2024-10-16 at 14 17 39

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions