Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

Summary

Misconfiguration in Qwen TOML file for B200 system found during VER. This PR fixes it.
RM: https://redmine.mellanox.com/issues/4840286

Test Plan

CI/CD
Internal B200 Cluster

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

A configuration parameter in the Megatron bridge test setup for Qwen 30B model was updated. The gpus_per_node value increased from 4 to 8, while the total num_gpus count remains unchanged at 8.

Changes

Cohort / File(s) Change Summary
Megatron Bridge GPU Configuration
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml
Updated gpus_per_node from 4 to 8 in the [cmd_args] section

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

Eight GPUs now dance on each node so bright,
The Qwen takes flight with computational might,
From four to eight, a simple reconfig tweak,
This bunny approves of changes so sleek! 🐰✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'B200 M-bridge misconfig' is related to the changeset and refers to fixing a misconfiguration in the B200 Megatron bridge configuration file, which is the primary change.
Description check ✅ Passed The description is clearly related to the changeset, explaining that it fixes a misconfiguration in the Qwen TOML file for the B200 system found during VER with a reference to the tracking issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review January 16, 2026 20:31
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile Summary

Fixed misconfiguration in B200 Qwen 30B configuration file where gpus_per_node was incorrectly set to 4 instead of 8.

  • Corrected gpus_per_node from 4 to 8 to properly match the B200 system architecture (8 GPUs per node)
  • This aligns with num_gpus = 8 (single node with 8 GPUs)
  • The fix ensures proper GPU resource allocation for Megatron-Bridge workloads on B200 systems

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The change corrects a clear misconfiguration by updating a single parameter value to match the expected B200 hardware configuration. The fix aligns gpus_per_node with the system's actual 8 GPUs per node and is consistent with the num_gpus setting.
  • No files require special attention

Important Files Changed

Filename Overview
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml Fixed gpus_per_node from 4 to 8 to match B200 system configuration with 8 GPUs per node

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant SlurmSystem
    participant MegatronBridge
    participant B200Cluster
    
    User->>CloudAI: Load config: megatron_bridge_qwen_30b.toml
    CloudAI->>CloudAI: Parse cmd_args
    Note over CloudAI: gpus_per_node = 8 (fixed)<br/>num_gpus = 8<br/>gpu_type = "b200"
    CloudAI->>MegatronBridge: Generate launcher command
    MegatronBridge->>MegatronBridge: Validate GPU configuration
    Note over MegatronBridge: Expects 8 GPUs per node<br/>for B200 systems
    MegatronBridge->>SlurmSystem: Submit batch job (1 node, 8 GPUs)
    SlurmSystem->>B200Cluster: Allocate resources
    Note over B200Cluster: Single node with<br/>8 B200 GPUs
    B200Cluster->>SlurmSystem: Resources allocated
    SlurmSystem->>MegatronBridge: Job ID
    MegatronBridge->>CloudAI: Job submitted successfully
    CloudAI->>User: Training started on B200
Loading

@alexmanle alexmanle added the bug Something isn't working label Jan 16, 2026
@srivatsankrishnan srivatsankrishnan merged commit 7663ca7 into NVIDIA:main Jan 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants