-
Notifications
You must be signed in to change notification settings - Fork 25
Feature/backends/megatron mlflow status #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tron-mlflow-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves MLflow run termination handling for Megatron training by ensuring runs are explicitly ended with proper terminal statuses and termination reasons, rather than being left in an active state.
Changes:
- Added
end_mlflow_run()helper function to terminate MLflow runs with explicit status (FINISHED/FAILED/KILLED) and termination reason tags - Wrapped main training logic in try-except-finally blocks to handle normal completion, early exits (SystemExit), keyboard interrupts, and exceptions
- Replaced implicit MLflow termination calls with explicit status-aware termination
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| primus/backends/megatron/training/global_vars.py | Added end_mlflow_run() helper function to centralize MLflow run termination with explicit status and reason tags |
| primus/modules/trainer/megatron/trainer.py | Added exception handling in run() method to capture training outcomes and call end_mlflow_run() with appropriate status; updated early-exit path to use new termination helper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Summary
Improve MLflow run termination reporting for Megatron training so the MLflow UI reflects the correct final status and a helpful termination reason.
What changed
Notes
MLflow termination updates are emitted only from the designated MLflow rank to avoid multi-rank contention.