-
Notifications
You must be signed in to change notification settings - Fork 208
fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM #1703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
📝 WalkthroughWalkthroughThe pull request updates the submodule pointer for Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)!(**/tests/**|**/test_*.py|**/test_*.sh)📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@guyueh1 there was an oom in unit test |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
@terrykong It is odd, I tried running the test locally on H100 but it doesn't fail. It is hard to imagine why my changes to MLM would cause this test to fail because this test only uses VllmGeneration. Anyways, I reduced the gpu_memory_utilization of this test's config to hope it will pass in CI. |
1 similar comment
|
@guyueh1 it might be the ci infra + some code change. i'm seeing our nightlies are now failing so i'm trying to bisect on an A100 |
|
@guyueh1 looks like it was a zombie process. @chtruong814 has fixed so let's try after you reverting the gpu util change |
Reverted, try deployment again. |
What does this PR do ?
This is a temporal fix for the CPU OOM issue we observe when saving large model checkpoint (example: Deepseek v3). This PR sets all checkpoint saving to use single-thread if the program is run on Arm platform, so the DGX H100 behavior will not change, but on GB200 you may observe slow-down. A permanent fix that sets the thread count based on model size threshold is in progress.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.