fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM #1703

guyueh1 · 2025-12-28T18:50:30Z

What does this PR do ?

This is a temporal fix for the CPU OOM issue we observe when saving large model checkpoint (example: Deepseek v3). This PR sets all checkpoint saving to use single-thread if the program is run on Arm platform, so the DGX H100 behavior will not change, but on GB200 you may observe slow-down. A permanent fix that sets the thread count based on model size threshold is in progress.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Chores
- Updated internal dependency versions.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

github-actions · 2025-12-28T18:51:29Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 141c6bf (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

coderabbitai · 2025-12-28T18:53:37Z

📝 Walkthrough

Walkthrough

The pull request updates the submodule pointer for 3rdparty/Megatron-LM-workspace/Megatron-LM to a newer commit. This is a dependency version bump with no code or functional modifications to the repository itself.

Changes

Cohort / File(s)	Summary
Submodule Update `3rdparty/Megatron-LM-workspace/Megatron-LM`	Submodule commit pointer updated from `25a62ed` to `b73ae5c`

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) #1568: Updates the same 3rdparty/Megatron-LM-workspace/Megatron-LM submodule pointer.

Suggested reviewers

terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	Major repository initialization with 120,414 lines across 654 files lacks documented test results and validation in PR description.	Complete pre-check items, document test results showing functionality across platforms, provide performance metrics, and update PR description to reflect testing completion.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: implementing single-thread checkpoint save on GB200 to address CPU OOM issues. It directly relates to the primary objective mentioned in the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f017fd8 and 141c6bf.

📒 Files selected for processing (1)

3rdparty/Megatron-LM-workspace/Megatron-LM

🧰 Additional context used

📓 Path-based instructions (1)

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

3rdparty/Megatron-LM-workspace/Megatron-LM

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (1)

3rdparty/Megatron-LM-workspace/Megatron-LM (1)

1-1: Verify the new submodule commit and confirm it aligns with PR objectives.

The submodule pointer was updated, but additional context is needed. The PR objectives describe code changes to fix CPU OOM when saving large model checkpoints on GB200 (single-thread checkpoint save), but the provided files only show this submodule version bump.

Please clarify the following:

Commit verification: Confirm that the new commit b73ae5cdab9d409fcface2b2f3c375710abe6911 exists and contains the GB200 checkpoint save optimization mentioned in the PR objectives.

Scope completeness: Are there other modified files in this repository (not submodule updates) that implement the single-thread checkpoint save fix? If so, they should be included in this review.

Downstream changes: Provide a summary of what changed in the Megatron-LM repository at the new commit hash to confirm it addresses the stated issue.

Testing: Confirm that the fix was tested on GB200/Arm platforms and that checkpoint saves complete without CPU OOM.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

terrykong · 2026-01-02T08:04:11Z

@guyueh1 there was an oom in unit test

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2026-01-03T00:28:51Z

@terrykong It is odd, I tried running the test locally on H100 but it doesn't fail. It is hard to imagine why my changes to MLM would cause this test to fail because this test only uses VllmGeneration. Anyways, I reduced the gpu_memory_utilization of this test's config to hope it will pass in CI.

github-actions · 2026-01-03T00:30:00Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 355d270 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-01-03T00:30:02Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 355d270 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

terrykong · 2026-01-03T00:38:51Z

@guyueh1 it might be the ci infra + some code change. i'm seeing our nightlies are now failing so i'm trying to bisect on an A100

terrykong · 2026-01-03T05:53:43Z

@guyueh1 looks like it was a zombie process. @chtruong814 has fixed so let's try after you reverting the gpu util change

This reverts commit 355d270.

guyueh1 · 2026-01-05T04:42:10Z

@guyueh1 looks like it was a zombie process. @chtruong814 has fixed so let's try after you reverting the gpu util change

Reverted, try deployment again.

github-actions · 2026-01-05T04:42:26Z

✅ Submodule Fast-Forward Check Results

Check based on commit: b9eaa25 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-01-05T04:42:55Z

✅ Submodule Fast-Forward Check Results

Check based on commit: f37f830 (PR #1703 from fix_gb200_dpsk_ckpt)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Add MLM diff to fix dpsk checkpoint on GB200

141c6bf

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 requested a review from ZhiyuLi-Nvidia December 28, 2025 18:50

guyueh1 self-assigned this Dec 28, 2025

guyueh1 requested a review from a team as a code owner December 28, 2025 18:50

guyueh1 added GB200 r0.5.0 labels Dec 28, 2025

guyueh1 requested a review from terrykong December 28, 2025 18:50

guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Dec 28, 2025

guyueh1 temporarily deployed to nemo-ci December 28, 2025 22:15 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci December 28, 2025 22:41 — with GitHub Actions Inactive

terrykong previously approved these changes Dec 29, 2025

View reviewed changes

terrykong enabled auto-merge (squash) December 29, 2025 19:26

ZhiyuLi-Nvidia previously approved these changes Dec 30, 2025

View reviewed changes

Fix unit test by reducing the gpu_memory_utilization

355d270

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 dismissed stale reviews from ZhiyuLi-Nvidia and terrykong via 355d270 January 3, 2026 00:29

guyueh1 requested a review from a team as a code owner January 3, 2026 00:29

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 3, 2026

guyueh1 temporarily deployed to nemo-ci January 3, 2026 00:29 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci January 3, 2026 00:38 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci January 3, 2026 04:51 — with GitHub Actions Inactive

guyueh1 and others added 2 commits January 4, 2026 20:41

Revert "Fix unit test by reducing the gpu_memory_utilization"

b9eaa25

This reverts commit 355d270.

Merge branch 'main' into fix_gb200_dpsk_ckpt

f37f830

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Jan 5, 2026

guyueh1 temporarily deployed to nemo-ci January 5, 2026 04:42 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci January 5, 2026 05:13 — with GitHub Actions Inactive

terrykong approved these changes Jan 5, 2026

View reviewed changes

fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM #1703

Are you sure you want to change the base?

fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM #1703

Conversation

guyueh1 commented Dec 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 28, 2025

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

coderabbitai bot commented Dec 28, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

terrykong commented Jan 2, 2026

Uh oh!

guyueh1 commented Jan 3, 2026

Uh oh!

github-actions bot commented Jan 3, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jan 3, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

terrykong commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong commented Jan 3, 2026

Uh oh!

guyueh1 commented Jan 5, 2026

Uh oh!

github-actions bot commented Jan 5, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions bot commented Jan 5, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guyueh1 commented Dec 28, 2025 •

edited by coderabbitai bot

Loading

terrykong commented Jan 3, 2026 •

edited

Loading