B200 M-bridge misconfig #777

srivatsankrishnan · 2026-01-16T20:17:00Z

Summary

Misconfiguration in Qwen TOML file for B200 system found during VER. This PR fixes it.
RM: https://redmine.mellanox.com/issues/4840286

Test Plan

CI/CD
Internal B200 Cluster

Additional Notes

…idge

coderabbitai · 2026-01-16T20:17:07Z

📝 Walkthrough

Walkthrough

A configuration parameter in the Megatron bridge test setup for Qwen 30B model was updated. The gpus_per_node value increased from 4 to 8, while the total num_gpus count remains unchanged at 8.

Changes

Cohort / File(s)	Change Summary
Megatron Bridge GPU Configuration `conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml`	Updated `gpus_per_node` from 4 to 8 in the `[cmd_args]` section

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

Eight GPUs now dance on each node so bright,
The Qwen takes flight with computational might,
From four to eight, a simple reconfig tweak,
This bunny approves of changes so sleek! 🐰✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'B200 M-bridge misconfig' is related to the changeset and refers to fixing a misconfiguration in the B200 Megatron bridge configuration file, which is the primary change.
Description check	✅ Passed	The description is clearly related to the changeset, explaining that it fixes a misconfiguration in the Qwen TOML file for the B200 system found during VER with a reference to the tracking issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-16T20:34:06Z

Greptile Summary

Fixed misconfiguration in B200 Qwen 30B configuration file where gpus_per_node was incorrectly set to 4 instead of 8.

Corrected gpus_per_node from 4 to 8 to properly match the B200 system architecture (8 GPUs per node)
This aligns with num_gpus = 8 (single node with 8 GPUs)
The fix ensures proper GPU resource allocation for Megatron-Bridge workloads on B200 systems

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The change corrects a clear misconfiguration by updating a single parameter value to match the expected B200 hardware configuration. The fix aligns gpus_per_node with the system's actual 8 GPUs per node and is consistent with the num_gpus setting.
No files require special attention

Important Files Changed

Filename	Overview
conf/experimental/megatron_bridge/test/b200/megatron_bridge_qwen_30b.toml	Fixed `gpus_per_node` from 4 to 8 to match B200 system configuration with 8 GPUs per node

Sequence Diagram

sequenceDiagram
    participant User
    participant CloudAI
    participant SlurmSystem
    participant MegatronBridge
    participant B200Cluster
    
    User->>CloudAI: Load config: megatron_bridge_qwen_30b.toml
    CloudAI->>CloudAI: Parse cmd_args
    Note over CloudAI: gpus_per_node = 8 (fixed)<br/>num_gpus = 8<br/>gpu_type = "b200"
    CloudAI->>MegatronBridge: Generate launcher command
    MegatronBridge->>MegatronBridge: Validate GPU configuration
    Note over MegatronBridge: Expects 8 GPUs per node<br/>for B200 systems
    MegatronBridge->>SlurmSystem: Submit batch job (1 node, 8 GPUs)
    SlurmSystem->>B200Cluster: Allocate resources
    Note over B200Cluster: Single node with<br/>8 B200 GPUs
    B200Cluster->>SlurmSystem: Resources allocated
    SlurmSystem->>MegatronBridge: Job ID
    MegatronBridge->>CloudAI: Job submitted successfully
    CloudAI->>User: Training started on B200

srivatsankrishnan and others added 4 commits January 16, 2026 11:17

fix gpu_node

db8bce5

Merge branch 'NVIDIA:main' into m-bridge

33105bc

fix gpu_node

946de04

Merge remote-tracking branch 'refs/remotes/origin/m-bridge' into m-br…

19ba5ed

…idge

srivatsankrishnan marked this pull request as ready for review January 16, 2026 20:31

srivatsankrishnan requested review from alexmanle, amaslenn and jeffnvidia as code owners January 16, 2026 20:31

alexmanle added the bug Something isn't working label Jan 16, 2026

alexmanle approved these changes Jan 16, 2026

View reviewed changes

srivatsankrishnan merged commit 7663ca7 into NVIDIA:main Jan 16, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

B200 M-bridge misconfig #777

B200 M-bridge misconfig #777

Uh oh!

srivatsankrishnan commented Jan 16, 2026

Uh oh!

coderabbitai bot commented Jan 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 16, 2026

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

B200 M-bridge misconfig #777

B200 M-bridge misconfig #777

Uh oh!

Conversation

srivatsankrishnan commented Jan 16, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 16, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jan 16, 2026 •

edited

Loading