-
Notifications
You must be signed in to change notification settings - Fork 25
Multinode projection with different parallelization strategies when single node is benchmarked #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
araina-amd
commented
Jan 14, 2026
- Multinode scaling projection from baseline to target node count
- Automatic config reduction for single-node benchmarking (PP and EP rescaling)
- Integration with pipeline simulation for accurate baseline calculation
- Per-layer communication estimation (TP AllReduce, MoE All-to-All)
- Detailed communication breakdown with message sizes
- Support for overlapped gradient all-reduce (default enabled)
|
Let's separate the style/formatting changes from the actual changes and make it into two PRs (if formatting is actually necessary). |
|
To the actual changes, there are several key things missing from the code. Let's discuss it offline. |
53259c7 to
23ea7c6
Compare
|
LGTM in general, I left some comments in the code as well as below:
|
|
Thanks, Anshu! I have some minor comments: (1) please fix the bot's findings -- mostly for unused variables; |
- Multinode scaling projection from baseline to target node count - Automatic config reduction for single-node benchmarking (PP and EP rescaling) - Integration with pipeline simulation for accurate baseline calculation - Per-layer communication estimation (TP AllReduce, MoE All-to-All) - Detailed communication breakdown with message sizes - Support for overlapped gradient all-reduce (default enabled)
…nce_projection - Delete primus/core/projection/multinode_projection/ directory - All multinode projection functionality is now in performance_projection/projection.py - Communication calculation, hardware config loading, and projection logic consolidated
…he benchmarked time.
…y accounted in the pipeline simulation model.
…a_parallel_size to use PROJECTION_NNODES, fixed wgrad double-counting (set to 0.0), removed wgrad additions for IO layers, and added zero-bubble scheduler support with 50/50 B/W split when enable_zero_bubble=True.
…d _run_pipeline_simulation_megatron_zb() to use actual Megatron zero-bubble scheduler (ILP-based) instead of simple heuristic scheduler. Add custom_hardware_example.yaml for hardware configuration. Plus fixing some prints. Usage: bash runner/primus-cli direct --script primus/cli/main.py -- projection performance --config examples/megatron/configs/MI300X/deepseek_v2_lite-BF16-pretrain.yaml --target-nodes 6 Projection accuracy for DeepSeek V2 Lite: - PP=3, EP=8 (3 nodes): Projected 6628ms vs Measured 6468ms = +2.5% error - PP=1, EP=16 (2 nodes): Projected 5337ms vs Measured 5276ms = +1.2% error
45b1952 to
5d6ac43
Compare
- Fix import spacing (add blank lines after imports) - Fix string quotes (single to double quotes) - Fix trailing whitespace - Fix function spacing (add blank lines between functions) - Format all affected files to pass CI black check
|
LGTM. @wenxie-amd can you please give it a review? |
… and fix total_gpus calculation in collective args.