Skip to content

Conversation

@guyueh1
Copy link
Contributor

@guyueh1 guyueh1 commented Dec 28, 2025

What does this PR do ?

This is a temporal fix for the CPU OOM issue we observe when saving large model checkpoint (example: Deepseek v3). This PR sets all checkpoint saving to use single-thread if the program is run on Arm platform, so the DGX H100 behavior will not change, but on GB200 you may observe slow-down. A permanent fix that sets the thread count based on model size threshold is in progress.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated internal dependency versions.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 self-assigned this Dec 28, 2025
@guyueh1 guyueh1 requested a review from a team as a code owner December 28, 2025 18:50
@guyueh1 guyueh1 requested a review from terrykong December 28, 2025 18:50
@github-actions
Copy link

✅ Submodule Fast-Forward Check Results

Check based on commit: 141c6bf (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 28, 2025

📝 Walkthrough

Walkthrough

The pull request updates the submodule pointer for 3rdparty/Megatron-LM-workspace/Megatron-LM to a newer commit. This is a dependency version bump with no code or functional modifications to the repository itself.

Changes

Cohort / File(s) Summary
Submodule Update
3rdparty/Megatron-LM-workspace/Megatron-LM
Submodule commit pointer updated from 25a62ed to b73ae5c

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

Suggested reviewers

  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning Major repository initialization with 120,414 lines across 654 files lacks documented test results and validation in PR description. Complete pre-check items, document test results showing functionality across platforms, provide performance metrics, and update PR description to reflect testing completion.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: implementing single-thread checkpoint save on GB200 to address CPU OOM issues. It directly relates to the primary objective mentioned in the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f017fd8 and 141c6bf.

📒 Files selected for processing (1)
  • 3rdparty/Megatron-LM-workspace/Megatron-LM
🧰 Additional context used
📓 Path-based instructions (1)
!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

  • 3rdparty/Megatron-LM-workspace/Megatron-LM
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (1)
3rdparty/Megatron-LM-workspace/Megatron-LM (1)

1-1: Verify the new submodule commit and confirm it aligns with PR objectives.

The submodule pointer was updated, but additional context is needed. The PR objectives describe code changes to fix CPU OOM when saving large model checkpoints on GB200 (single-thread checkpoint save), but the provided files only show this submodule version bump.

Please clarify the following:

  1. Commit verification: Confirm that the new commit b73ae5cdab9d409fcface2b2f3c375710abe6911 exists and contains the GB200 checkpoint save optimization mentioned in the PR objectives.
  2. Scope completeness: Are there other modified files in this repository (not submodule updates) that implement the single-thread checkpoint save fix? If so, they should be included in this review.
  3. Downstream changes: Provide a summary of what changed in the Megatron-LM repository at the new commit hash to confirm it addresses the stated issue.
  4. Testing: Confirm that the fix was tested on GB200/Arm platforms and that checkpoint saves complete without CPU OOM.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Dec 28, 2025
terrykong
terrykong previously approved these changes Dec 29, 2025
@terrykong terrykong enabled auto-merge (squash) December 29, 2025 19:26
ZhiyuLi-Nvidia
ZhiyuLi-Nvidia previously approved these changes Dec 30, 2025
@terrykong
Copy link
Contributor

@guyueh1 there was an oom in unit test

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1
Copy link
Contributor Author

guyueh1 commented Jan 3, 2026

@terrykong It is odd, I tried running the test locally on H100 but it doesn't fail. It is hard to imagine why my changes to MLM would cause this test to fail because this test only uses VllmGeneration. Anyways, I reduced the gpu_memory_utilization of this test's config to hope it will pass in CI.

@guyueh1 guyueh1 dismissed stale reviews from ZhiyuLi-Nvidia and terrykong via 355d270 January 3, 2026 00:29
@guyueh1 guyueh1 requested a review from a team as a code owner January 3, 2026 00:29
@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 3, 2026
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

✅ Submodule Fast-Forward Check Results

Check based on commit: 355d270 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

1 similar comment
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

✅ Submodule Fast-Forward Check Results

Check based on commit: 355d270 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong
Copy link
Contributor

terrykong commented Jan 3, 2026

@guyueh1 it might be the ci infra + some code change. i'm seeing our nightlies are now failing so i'm trying to bisect on an A100

@terrykong
Copy link
Contributor

@guyueh1 looks like it was a zombie process. @chtruong814 has fixed so let's try after you reverting the gpu util change

@guyueh1
Copy link
Contributor Author

guyueh1 commented Jan 5, 2026

@guyueh1 looks like it was a zombie process. @chtruong814 has fixed so let's try after you reverting the gpu util change

Reverted, try deployment again.

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

✅ Submodule Fast-Forward Check Results

Check based on commit: b9eaa25 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 5, 2026
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

✅ Submodule Fast-Forward Check Results

Check based on commit: f37f830 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests GB200 r0.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants