Skip to content

Conversation

@mvstrauss
Copy link
Contributor

Summary
Improve MLflow run termination reporting for Megatron training so the MLflow UI reflects the correct final status and a helpful termination reason.

What changed

  • Explicitly end MLflow runs with a terminal status (FINISHED / FAILED / KILLED) instead of leaving status implicit.
  • Record a termination_reason tag to indicate why the run ended (e.g. clean finish, early exit condition, keyboard interrupt, unknown exception).
  • Make termination handling robust across normal completion, early-exit (SystemExit/sys.exit), and exceptions, to avoid runs being left “active” in the MLflow UI.

Notes
MLflow termination updates are emitted only from the designated MLflow rank to avoid multi-rank contention.

Copilot AI review requested due to automatic review settings January 20, 2026 07:35
@mvstrauss mvstrauss requested review from lhzhang333 and removed request for Copilot January 20, 2026 07:36
Copilot AI review requested due to automatic review settings January 20, 2026 07:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves MLflow run termination handling for Megatron training by ensuring runs are explicitly ended with proper terminal statuses and termination reasons, rather than being left in an active state.

Changes:

  • Added end_mlflow_run() helper function to terminate MLflow runs with explicit status (FINISHED/FAILED/KILLED) and termination reason tags
  • Wrapped main training logic in try-except-finally blocks to handle normal completion, early exits (SystemExit), keyboard interrupts, and exceptions
  • Replaced implicit MLflow termination calls with explicit status-aware termination

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
primus/backends/megatron/training/global_vars.py Added end_mlflow_run() helper function to centralize MLflow run termination with explicit status and reason tags
primus/modules/trainer/megatron/trainer.py Added exception handling in run() method to capture training outcomes and call end_mlflow_run() with appropriate status; updated early-exit path to use new termination helper

Copilot AI review requested due to automatic review settings January 29, 2026 08:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants