Skip to content

Conversation

@araina-amd
Copy link
Contributor

  • Multinode scaling projection from baseline to target node count
  • Automatic config reduction for single-node benchmarking (PP and EP rescaling)
  • Integration with pipeline simulation for accurate baseline calculation
  • Per-layer communication estimation (TP AllReduce, MoE All-to-All)
  • Detailed communication breakdown with message sizes
  • Support for overlapped gradient all-reduce (default enabled)

@yuankaichen-amd
Copy link
Contributor

Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary).

@yuankaichen-amd
Copy link
Contributor

To the actual changes, there are several key things missing from the code. Let's discuss it offline.

@araina-amd araina-amd marked this pull request as draft January 15, 2026 19:20
@araina-amd araina-amd changed the title Multinode projection with different parallelization strategies when single node is benchmarked [WIP] Multinode projection with different parallelization strategies when single node is benchmarked Jan 15, 2026
@araina-amd araina-amd force-pushed the dev/araina/multinode_performance_model branch from 53259c7 to 23ea7c6 Compare January 16, 2026 00:46
@yuankaichen-amd
Copy link
Contributor

LGTM in general, I left some comments in the code as well as below:

  1. Baseline (time, nodes) in the CLI input and its related code is not very useful. Since it is only used in printing results, I suggest we should remove those.

  2. Please make PROJECTION_NNODES=4 as a CLI flag, if not specified, default to the baseline_nodes which is to be calculated based on pp/tp/ep/... in the config

  3. Document an example of hardware config in the CLI level and what should be included. If user doesn't provide one, what are we using? Is the collective model able to select config numbers based on GPUs/Nics it detects on the node?

@araina-amd araina-amd changed the title [WIP] Multinode projection with different parallelization strategies when single node is benchmarked Multinode projection with different parallelization strategies when single node is benchmarked Jan 24, 2026
@araina-amd araina-amd marked this pull request as ready for review January 24, 2026 01:47
@yuankaichen-amd
Copy link
Contributor

Thanks, Anshu! I have some minor comments:

(1) please fix the bot's findings -- mostly for unused variables;
(2) can we move some of the functions in the performance.py to separate files? it would be good for readability.
(3) please also add a readme file.

- Multinode scaling projection from baseline to target node count
- Automatic config reduction for single-node benchmarking (PP and EP rescaling)
- Integration with pipeline simulation for accurate baseline calculation
- Per-layer communication estimation (TP AllReduce, MoE All-to-All)
- Detailed communication breakdown with message sizes
- Support for overlapped gradient all-reduce (default enabled)
…nce_projection

- Delete primus/core/projection/multinode_projection/ directory
- All multinode projection functionality is now in performance_projection/projection.py
- Communication calculation, hardware config loading, and projection logic consolidated
…y accounted in the pipeline simulation model.
…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.
…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler.

Add custom_hardware_example.yaml for hardware configuration.
Plus fixing some prints.
Usage:
	bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6
Projection accuracy for DeepSeek V2 Lite:
	- PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error
	- PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error
@araina-amd araina-amd force-pushed the dev/araina/multinode_performance_model branch from 45b1952 to 5d6ac43 Compare January 28, 2026 01:12
- Fix import spacing (add blank lines after imports)
- Fix string quotes (single to double quotes)
- Fix trailing whitespace
- Fix function spacing (add blank lines between functions)
- Format all affected files to pass CI black check
@yuankaichen-amd
Copy link
Contributor

LGTM. @wenxie-amd can you please give it a review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants