Skip to content

Conversation

@jjasoncool
Copy link

Summary

This pull request upgrades the Holistic Trace Analysis (HTA) dependency from commit d731cc2e2249976c97129d409a83bd53d93051f6 to version v0.5.0, which includes native AMD GPU support and resolves critical path analysis issues in ROCm environments.

Problem Addressed: The previous HTA version had a known issue where TraceCounters._get_queue_length_time_series_for_rank() could return None when processing ROCm/HIP traces, causing TypeError: 'NoneType' object is not subscriptable in [\facebookresearch\holistictraceanalysis\tree\main\hta\analyzers\critical_path_analysis.py]. This required manual patching to handle the None case.

Solution: HTA v0.5.0 (released May 29, 2024) officially added "support for AMD GPUs" and includes fixes for queue length processing issues. The official codebase now properly handles queue length data on clipped dataframes by using the full trace (t_full) instead, eliminating the need for manual patches.

Changes Made:

Updated dockerfile to use HTA v0.5.0 instead of the legacy commit
Removed dependency on manual HTA patches (fix_hta_critical_path.py can now be safely deleted)
Updated project documentation to reflect the complete PyTorch → Chakra ET → ASTRA-sim workflow
Benefits:

Native ROCm/AMD GPU compatibility without manual patches
More stable critical path analysis for ROCm traces
Cleaner codebase with fewer workarounds
Better alignment with official HTA development
Test Plan
Testing Environment:

Docker environment with ROCm/PyTorch base image
AMD GPU hardware with ROCm 6.x drivers
Multi-GPU distributed training setup (2 GPUs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant