Feature/backends/megatron mlflow status #498

mvstrauss · 2026-01-20T07:35:45Z

Summary
Improve MLflow run termination reporting for Megatron training so the MLflow UI reflects the correct final status and a helpful termination reason.

What changed

Explicitly end MLflow runs with a terminal status (FINISHED / FAILED / KILLED) instead of leaving status implicit.
Record a termination_reason tag to indicate why the run ended (e.g. clean finish, early exit condition, keyboard interrupt, unknown exception).
Make termination handling robust across normal completion, early-exit (SystemExit/sys.exit), and exceptions, to avoid runs being left “active” in the MLflow UI.

Notes
MLflow termination updates are emitted only from the designated MLflow rank to avoid multi-rank contention.

…tron-mlflow-status

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

This PR improves MLflow run termination handling for Megatron training by ensuring runs are explicitly ended with proper terminal statuses and termination reasons, rather than being left in an active state.

Changes:

Added end_mlflow_run() helper function to terminate MLflow runs with explicit status (FINISHED/FAILED/KILLED) and termination reason tags
Wrapped main training logic in try-except-finally blocks to handle normal completion, early exits (SystemExit), keyboard interrupts, and exceptions
Replaced implicit MLflow termination calls with explicit status-aware termination

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
primus/backends/megatron/training/global_vars.py	Added `end_mlflow_run()` helper function to centralize MLflow run termination with explicit status and reason tags
primus/modules/trainer/megatron/trainer.py	Added exception handling in `run()` method to capture training outcomes and call `end_mlflow_run()` with appropriate status; updated early-exit path to use new termination helper

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

primus/modules/trainer/megatron/trainer.py

Mikael Strauss and others added 2 commits December 10, 2025 16:52

mlflow ending status

a0afc82

fix(megatron): ensure mlflow run ends on exit

dc876e4

mvstrauss requested a review from Xiaoming-AMD as a code owner January 20, 2026 07:35

Copilot AI review requested due to automatic review settings January 20, 2026 07:35

mvstrauss requested review from limou102 and wenxie-amd as code owners January 20, 2026 07:35

mvstrauss requested review from lhzhang333 and removed request for Copilot January 20, 2026 07:36

Merge remote-tracking branch 'origin/main' into feature/backends/mega…

b6e178e

…tron-mlflow-status

github-code-quality bot found potential problems Jan 20, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Fixed Show fixed Hide fixed

primus/modules/trainer/megatron/trainer.py Fixed Show fixed Hide fixed

github-code-quality bot found potential problems Jan 20, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Fixed Show fixed Hide fixed

primus/modules/trainer/megatron/trainer.py Fixed Show fixed Hide fixed

chore(megatron): document best-effort teardown

06045f7

Copilot AI review requested due to automatic review settings January 20, 2026 07:50

Copilot AI reviewed Jan 20, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

mvstrauss and others added 2 commits January 20, 2026 08:21

Merge branch 'main' into feature/backends/megatron-mlflow-status

cdb9d94

Merge branch 'main' into feature/backends/megatron-mlflow-status

8609f2d

Copilot AI review requested due to automatic review settings January 29, 2026 08:56

Copilot AI reviewed Jan 29, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Merge branch 'main' into feature/backends/megatron-mlflow-status

5a5ea7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/backends/megatron mlflow status #498

Feature/backends/megatron mlflow status #498

Uh oh!

mvstrauss commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/backends/megatron mlflow status #498

Are you sure you want to change the base?

Feature/backends/megatron mlflow status #498

Uh oh!

Conversation

mvstrauss commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants