-
Notifications
You must be signed in to change notification settings - Fork 209
perf: Add DeepEP support to Megatron Policy #1645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: Add DeepEP support to Megatron Policy #1645
Conversation
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds support for three Megatron Mixture-of-Experts (MoE) configuration fields— Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/configs/grpo_math_1B_megatron.yaml(1 hunks)examples/configs/sft_openmathinstruct2_megatron.yaml(1 hunks)nemo_rl/models/policy/workers/megatron_policy_worker.py(1 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
!(**/tests/**|**/test_*.py|**/test_*.sh)
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year
Files:
examples/configs/grpo_math_1B_megatron.yamlnemo_rl/models/policy/workers/megatron_policy_worker.pyexamples/configs/sft_openmathinstruct2_megatron.yamlpyproject.toml
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Conform code to Python 3.12+
Indent code with 4 spaces. Do not use tabs
Use snake_case for file names
Use PascalCase for class names
Use snake_case for function and method names
Use snake_case for local variables
Prefix variable names that start with a number with 'k' (e.g., k_99th_percentile)
Use upper snake_case with 'G' prefix for global variables (e.g., G_MY_GLOBAL)
Use upper snake_case for constants
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
Prefer docstrings over comments for interfaces that may be used outside a file
Reserve comments for code within a function or interfaces that are local to a file
If a piece of code is commented out, include a comment describing its usage and why it's commented out. Remove debug comments before merging
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx
Avoid using reflection when functionality can be easily achieved without reflection
When using try-except blocks, limit the except clause to the smallest set of specific errors possible
When using try-except blocks for duck-typing, keep the body of the try as small as possible and use the else block for logic
YAML is the single source of truth for configuration defaults. Do not set non-None defaults in code for configuration values
For required configuration attributes, access config directly and expect presence (e.g., policy_cfg['precision']) without hidden defaults
Use typing.NotRequired to mark optional attributes in TypedDict for configuration
When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, and recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml
Follow the Google Python Style Guide for Python code
Files:
nemo_rl/models/policy/workers/megatron_policy_worker.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
For any source file under nemo_rl/*.py that defines a class or function decorated with @ray.remote, add a coverage pragma (# pragma: no cover) because these run in separate Ray processes
Files:
nemo_rl/models/policy/workers/megatron_policy_worker.py
**/*.{py,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)
Files:
nemo_rl/models/policy/workers/megatron_policy_worker.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
examples/configs/grpo_math_1B_megatron.yamlexamples/configs/sft_openmathinstruct2_megatron.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (2)
examples/configs/grpo_math_1B_megatron.yaml (1)
97-99: Consistent MoE configuration across example configs.The MoE configuration keys are consistently defined with the same conservative defaults as in
sft_openmathinstruct2_megatron.yaml. This ensures backward compatibility and predictable behavior across different algorithm configurations.pyproject.toml (1)
101-101: DeepEP dependency configuration is valid.The specified commit (
e3908bf5bd0cc6265bcb225d15cd8c996d4759ef) exists in the DeepEP repository and is accessible as a stable reference point for production use.
| moe_enable_deepep: false | ||
| moe_token_dispatcher_type: "allgather" | ||
| moe_shared_expert_overlap: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for TypedDict definitions and MoE-related config keys
echo "=== Searching for MegatronConfig TypedDict definition ==="
find . -name "*.py" -type f ! -path "./tests/*" ! -path "./.git/*" | xargs grep -l "class MegatronConfig" | head -5
echo -e "\n=== Searching for MoE config keys in Python files ==="
rg -n "(moe_enable_deepep|moe_token_dispatcher_type|moe_shared_expert_overlap)" --type=py | head -20
echo -e "\n=== Searching for MoE-related TypedDict or config definitions ==="
rg -n -B2 -A10 "moe_enable_deepep|moe_token_dispatcher_type" --type=py | head -40Repository: NVIDIA-NeMo/RL
Length of output: 2695
🏁 Script executed:
#!/bin/bash
# Examine the PolicyConfig TypedDict definition in the policy module
echo "=== Reading PolicyConfig TypedDict definition ==="
cat nemo_rl/models/policy/__init__.py | head -200
echo -e "\n=== Searching for MegatronConfig definition ==="
rg -n -A30 "class MegatronConfig.*TypedDict" --type=pyRepository: NVIDIA-NeMo/RL
Length of output: 9554
🏁 Script executed:
#!/bin/bash
# Check the full MegatronConfig TypedDict to see if we missed the MoE keys
echo "=== Reading full MegatronConfig definition ==="
rg -n "class MegatronConfig" -A 40 --type=py nemo_rl/models/policy/__init__.pyRepository: NVIDIA-NeMo/RL
Length of output: 2137
🏁 Script executed:
#!/bin/bash
# Final verification - check if these keys are defined anywhere in TypedDict context
echo "=== Checking if MoE keys appear in any TypedDict definitions ==="
rg -n "moe_enable_deepep|moe_token_dispatcher_type|moe_shared_expert_overlap" --type=py -B3 -A3
echo -e "\n=== Check the actual YAML file to see the context ==="
head -100 examples/configs/sft_openmathinstruct2_megatron.yaml | tail -20Repository: NVIDIA-NeMo/RL
Length of output: 2417
Add MoE configuration keys to MegatronConfig TypedDict definition.
The keys moe_enable_deepep, moe_token_dispatcher_type, and moe_shared_expert_overlap are used in the YAML exemplar and referenced in megatron_policy_worker.py (lines 594-601), but they are missing from the MegatronConfig TypedDict definition in nemo_rl/models/policy/__init__.py.
Add these keys to the TypedDict with:
- Type annotations
NotRequiredmarker (if optional)- Inline documentation explaining their purpose and valid values
For reference: moe_token_dispatcher_type accepts values like "allgather", "alltoall", or "flex"; moe_enable_deepep is a boolean; moe_shared_expert_overlap is a boolean.
🤖 Prompt for AI Agents
In nemo_rl/models/policy/__init__.py around the MegatronConfig TypedDict
definition (add near existing config keys), the TypedDict is missing the MoE
keys referenced in examples and megatron_policy_worker.py; add the following
entries to MegatronConfig: moe_enable_deepep: NotRequired[bool] with inline
docstring "Enable DeepEP optimization for MoE (true/false)";
moe_token_dispatcher_type: NotRequired[str] with inline docstring "Token
dispatcher type for MoE; valid values: 'allgather', 'alltoall', or 'flex'";
moe_shared_expert_overlap: NotRequired[bool] with inline docstring "Allow shared
expert overlap across MoE experts (true/false)". Ensure you import NotRequired
from typing_extensions (or typing if available) and keep styling consistent with
neighboring keys.
| model_cfg.moe_enable_deepep = self.cfg["megatron_cfg"][ | ||
| "moe_enable_deepep" | ||
| ] | ||
| model_cfg.moe_token_dispatcher_type = self.cfg["megatron_cfg"][ | ||
| "moe_token_dispatcher_type" | ||
| ] | ||
| model_cfg.moe_shared_expert_overlap = self.cfg["megatron_cfg"][ | ||
| "moe_shared_expert_overlap" | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the megatron_policy_worker.py file and check the context around lines 594-602
find . -name "megatron_policy_worker.py" -type f
# Also search for config-related files and TypedDict definitions
find . -path "*/nemo_rl/*" -name "*config*" -type f | head -20
# Search for example YAML configs
find . -path "*/examples/configs/*" -name "*.yaml" -o -name "*.yml" | head -20Repository: NVIDIA-NeMo/RL
Length of output: 979
🏁 Script executed:
# Search for references to moe_enable_deepep, moe_token_dispatcher_type, moe_shared_expert_overlap
grep -r "moe_enable_deepep" . --include="*.py" --include="*.yaml" --include="*.yml" 2>/dev/null
grep -r "moe_token_dispatcher_type" . --include="*.py" --include="*.yaml" --include="*.yml" 2>/dev/null
grep -r "moe_shared_expert_overlap" . --include="*.py" --include="*.yaml" --include="*.yml" 2>/dev/nullRepository: NVIDIA-NeMo/RL
Length of output: 1246
🏁 Script executed:
# Find MegatronConfig TypedDict definition
find . -path "*/nemo_rl/*" -name "*.py" -type f | xargs grep -l "class MegatronConfig\|MegatronConfig.*TypedDict" | head -5
# Also check the specific file to see surrounding code
head -700 ./nemo_rl/models/policy/workers/megatron_policy_worker.py | tail -150Repository: NVIDIA-NeMo/RL
Length of output: 7620
🏁 Script executed:
# Find the MegatronConfig TypedDict definition
grep -r "class MegatronConfig" ./nemo_rl --include="*.py" -A 50 | head -100
# Also check the policy __init__.py which might import/define it
grep -r "moe_enable_deepep\|moe_token_dispatcher_type\|moe_shared_expert_overlap" ./nemo_rl/models/policy/__init__.py -B 5 -A 5Repository: NVIDIA-NeMo/RL
Length of output: 4121
🏁 Script executed:
# Search for the moe-related keys in the TypedDict definition
grep -r "moe_enable_deepep\|moe_token_dispatcher_type\|moe_shared_expert_overlap\|moe_router_dtype\|moe_permute_fusion" ./nemo_rl/models/policy/__init__.py -B 2 -A 2
# Get more context from the MegatronConfig class definition
sed -n '/^class MegatronConfig/,/^class [A-Z]/p' ./nemo_rl/models/policy/__init__.py | head -150Repository: NVIDIA-NeMo/RL
Length of output: 161
🏁 Script executed:
# Get the full MegatronConfig TypedDict definition - reading the file more carefully
cat ./nemo_rl/models/policy/__init__.py | grep -A 200 "^class MegatronConfig(TypedDict):" | grep -B 5 "^class " | head -200Repository: NVIDIA-NeMo/RL
Length of output: 1041
🏁 Script executed:
# Read the file directly and extract MegatronConfig definition
python3 << 'EOF'
with open('./nemo_rl/models/policy/__init__.py', 'r') as f:
content = f.read()
# Find MegatronConfig class
import re
match = re.search(r'class MegatronConfig\(TypedDict\):(.*?)(?=\nclass \w+)', content, re.DOTALL)
if match:
moe_config = match.group(0)
# Print lines containing moe
for line in moe_config.split('\n'):
if 'moe' in line.lower() or line.strip().startswith(('moe_', 'freeze_', 'optimizer:', 'scheduler:', 'distributed')):
print(line)
EOFRepository: NVIDIA-NeMo/RL
Length of output: 251
🏁 Script executed:
# Read the entire MegatronConfig class definition with line numbers
python3 << 'EOF'
import re
with open('./nemo_rl/models/policy/__init__.py', 'r') as f:
lines = f.readlines()
in_megatron_config = False
class_indent = 0
for i, line in enumerate(lines, 1):
if 'class MegatronConfig(TypedDict):' in line:
in_megatron_config = True
class_indent = len(line) - len(line.lstrip())
print(f"{i}: {line}", end='')
elif in_megatron_config:
current_indent = len(line) - len(line.lstrip())
# Stop when we hit a new class at the same or lower indent level
if line.strip().startswith('class ') and current_indent <= class_indent:
break
print(f"{i}: {line}", end='')
EOFRepository: NVIDIA-NeMo/RL
Length of output: 1676
🏁 Script executed:
# Search for where these moe keys are defined in TypedDict
python3 << 'EOF'
import re
with open('./nemo_rl/models/policy/__init__.py', 'r') as f:
content = f.read()
# Look for all TypedDict definitions that might have these keys
for match in re.finditer(r'(class \w+(?:Config)?\(TypedDict\):.*?)(?=\nclass |\Z)', content, re.DOTALL):
class_def = match.group(0)
if any(key in class_def for key in ['moe_enable_deepep', 'moe_token_dispatcher_type', 'moe_shared_expert_overlap', 'moe_router_dtype', 'moe_permute_fusion']):
print(class_def[:500])
print("\n" + "="*80 + "\n")
EOF
# If not found, let's see if they're defined elsewhere
grep -r "moe_enable_deepep" ./nemo_rl --include="*.py" -B 3 -A 1 | head -30Repository: NVIDIA-NeMo/RL
Length of output: 637
🏁 Script executed:
# Let's check if these keys exist anywhere in the config definition
grep -r "moe_router_dtype\|moe_permute_fusion\|moe_router_load_balancing_type\|moe_router_bias_update_rate" ./nemo_rl/models/policy/__init__.py
# Let's see the full context around line 146 and beyond
sed -n '120,200p' ./nemo_rl/models/policy/__init__.pyRepository: NVIDIA-NeMo/RL
Length of output: 3372
🏁 Script executed:
# Check if there are any other files that extend or add to MegatronConfig
grep -r "moe_enable_deepep\|moe_token_dispatcher_type\|moe_shared_expert_overlap" ./nemo_rl --include="*.py" | grep -i "typeddict\|typing"
# Check the coding guidelines mentioned in the learning
cat CODING_GUIDELINES.md 2>/dev/null | grep -A 5 "TypedDict\|config key"
# Count all the megatron example configs
find ./examples/configs -name "*megatron*" -type f
# Verify all three keys are in every megatron example config
for file in $(find ./examples/configs -name "*megatron*" -type f); do
echo "=== $file ==="
grep -c "moe_enable_deepep\|moe_token_dispatcher_type\|moe_shared_expert_overlap" "$file" || echo "MISSING KEYS"
doneRepository: NVIDIA-NeMo/RL
Length of output: 5715
Add moe_enable_deepep, moe_token_dispatcher_type, and moe_shared_expert_overlap to MegatronConfig TypedDict and include in all exemplar configs.
These three MoE configuration keys are accessed directly without fallbacks (lines 594-602), but they are missing from the TypedDict definition and absent from 30 of 32 megatron example configs. Per coding guidelines, new config keys must be added to the TypedDict and reflected in exemplar YAMLs under examples/configs/.yaml. Either add these as required attributes to MegatronConfig and ensure they appear in all exemplar configs (grpo_math_B_megatron.yaml, sft_.yaml, vlm_grpo_.yaml, etc.), or mark them as NotRequired and provide appropriate defaults.
🤖 Prompt for AI Agents
In nemo_rl/models/policy/workers/megatron_policy_worker.py around lines 594-602,
the code reads three MoE keys directly (moe_enable_deepep,
moe_token_dispatcher_type, moe_shared_expert_overlap) but these keys are not
declared in the MegatronConfig TypedDict and are missing from most example
YAMLs; add these keys to the MegatronConfig TypedDict (either as required
attributes or as NotRequired) and update all exemplar configs under
examples/configs/*.yaml so each file includes these keys (if NotRequired, give
sensible defaults when reading the config or in the TypedDict comments; if
required, add explicit values in every exemplar). Ensure the defaults/types
match usage in megatron_policy_worker (boolean for moe_enable_deepep, string for
moe_token_dispatcher_type, numeric/integer for moe_shared_expert_overlap) and
run config validation/tests to confirm no missing-key errors.
guyueh1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pyproject.toml
Outdated
| # https://github.com/NVIDIA/TransformerEngine/blob/v2.3/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L108 | ||
| # https://github.com/facebookresearch/xformers/blob/8354497deb2c04c67fbb2e2ad911e86530da0e90/xformers/ops/fmha/flash.py#L76 | ||
| "flash-attn==2.8.1", | ||
| "deep_ep @ git+https://github.com/deepseek-ai/DeepEP.git@e3908bf5bd0cc6265bcb225d15cd8c996d4759ef", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for v0.5 we are waiting on https://github.com/NVIDIA-NeMo/RL/pull/1470/changes#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711
so that PR will change to a diff deep ep version. can you verify that this new one works? note there are two changes you need to make:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I can test with newer deep_ep and push changes once upstream is updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terrykong Upgraded the DeepEP variant but the latest run seems to have removed a bunch of dependencies from uv.lock. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to let uv do whatever with uv.lock
…/deepep_mcore_training
|
@parthmannan please fix lint and then attach CI:L2 label again to kick off tests |
…/deepep_mcore_training
What does this PR do ?
Adds DeepEP support to megatron policy for training MoE workloads. Disabled by default in the example configs so no change to current behavior.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.