fix: grad norm check for automodel gpt oss nightly #1708

hemildesai · 2026-01-04T23:36:50Z

fixes the end grad norm check after #1693

Summary by CodeRabbit

Tests
- Updated metric validation thresholds in model training test suite to reflect revised performance expectations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Hemil Desai <hemild@nvidia.com>

coderabbitai · 2026-01-04T23:41:10Z

📝 Walkthrough

Walkthrough

Test configuration file updated to modify gradient norm metric validation. The check for train/grad_norm at step 50 transitions from a single upper bound (< 2.5) to a bounded range (10.0 ≤ value ≤ 17.5), altering the acceptance criteria for model training validation.

Changes

Cohort / File(s)	Summary
Test metric validation `tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh`	Modified `train/grad_norm` check at step 50 from single upper bound (`< 2.5`) to range check (`10.0 ≤ value ≤ 17.5`). Adds lower bound and raises upper bound threshold.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

CI:L1, Run CICD

Suggested reviewers

terrykong
yuki-97

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: grad norm check for automodel gpt oss nightly' directly and specifically describes the main change—adjusting the grad norm check for the automodel GPT OSS nightly test.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR contains minor test metric threshold adjustments for grad_norm validation, directly supporting prior PR #1693 changes. No major features or breaking changes present.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8d6569 and 7795224.

📒 Files selected for processing (1)

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

🧰 Additional context used

📓 Path-based instructions (4)

**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Use uv run instead of python to execute scripts
Follow the Google Shell Style Guide for shell scripts

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

tests/test_suites/**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

tests/test_suites/**/*.sh: When adding support for a new model, create a corresponding driver shell script under tests/test_suites/ in the matching domain
Driver shell scripts should match the YAML base name with .sh extension and invoke training entrypoint with uv run

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

!(**/tests/**|**/test_*.py|**/test_*.sh)

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Add the NVIDIA copyright header to all Python files and shell scripts (excluding tests). The header should include the current year

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

**/*.{py,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

The NVIDIA copyright header should appear at the top of all Python files and shell scripts (excluding tests)

Files:

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (1)

tests/test_suites/llm/sft-gpt-oss-20b-1n8g-fsdp8ep8-automodel.sh (1)

39-40: Ensure PR description documents empirical justification for gradient norm bounds change.

The gradient norm bounds change (from < 2.5 to [10.0, 17.5]) represents a significant shift in expected training behavior. Per the learnings on convergence-impacting changes, the PR description should include evidence demonstrating that this adjustment is empirically grounded and does not introduce regressions.

Confirm that the PR description includes:

Actual gradient norm values observed during training runs post-PR fix: grad norm calculation for dtensor v2 #1693

Justification for the specific bounds [10.0, 17.5]

Verification that this change does not regress convergence or final metrics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

fix: grad norm check for automodel gpt oss nightly

7795224

Signed-off-by: Hemil Desai <hemild@nvidia.com>

hemildesai requested a review from a team as a code owner January 4, 2026 23:36

hemildesai added the CI:L0 Run doctests and unit tests label Jan 4, 2026

hemildesai temporarily deployed to nemo-ci January 4, 2026 23:41 — with GitHub Actions Inactive

hemildesai temporarily deployed to nemo-ci January 4, 2026 23:44 — with GitHub Actions Inactive

terrykong enabled auto-merge (squash) January 5, 2026 00:00

terrykong approved these changes Jan 5, 2026

View reviewed changes

terrykong added the r0.5.0 label Jan 5, 2026

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Jan 5, 2026

yuki-97 temporarily deployed to nemo-ci January 5, 2026 05:04 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci January 5, 2026 05:08 — with GitHub Actions Inactive

terrykong merged commit 13c3cd6 into main Jan 5, 2026
69 of 74 checks passed

terrykong deleted the hemil/fix-gpt-oss-nightly branch January 5, 2026 06:56

chtruong814 pushed a commit that referenced this pull request Jan 5, 2026

fix: grad norm check for automodel gpt oss nightly (#1708)

f443763

Signed-off-by: Hemil Desai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Jan 5, 2026

cp: fix: grad norm check for automodel gpt oss nightly (1708) into r0.5.0 #1711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: grad norm check for automodel gpt oss nightly #1708

fix: grad norm check for automodel gpt oss nightly #1708

Uh oh!

hemildesai commented Jan 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 4, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: grad norm check for automodel gpt oss nightly #1708

fix: grad norm check for automodel gpt oss nightly #1708

Uh oh!

Conversation

hemildesai commented Jan 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 4, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hemildesai commented Jan 4, 2026 •

edited by coderabbitai bot

Loading