-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Add MLflow artifact upload for traces and logs #440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.
Key changes:
- New artifact upload module with functions to collect and upload trace/log files to MLflow
- Integration of artifact uploads before MLflow run completion in the trainer
- Configuration options to control trace and log uploads (defaulting to enabled)
- Shell script improvements for timestamp-based output directories with multi-node consistency
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New module implementing trace/log file discovery and MLflow artifact upload functionality |
| primus/backends/megatron/training/global_vars.py | Adds global variable for exp_root_path and wrapper function for artifact uploads |
| primus/modules/trainer/megatron/trainer.py | Integrates artifact upload calls before MLflow run termination in two exit paths |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true) |
| examples/run_slurm_pretrain.sh | Implements timestamp-based output directory naming and exports timestamp for multi-node consistency |
| examples/run_pretrain.sh | Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message |
| examples/run_local_pretrain.sh | Adds MLflow environment variables and Primus path variables to Docker container environment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3c149be to
13dfa81
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
examples/run_slurm_pretrain.sh
Outdated
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | ||
| MODEL_NAME=$(basename "${EXP:-unknown}" .yaml) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_NAME falls back to unknown when EXP is unset, but run_local_pretrain.sh provides a default EXP. This can lead to confusing output directories (e.g., unknown_<ts>) for users relying on defaults. Consider defaulting EXP here as well (or deriving MODEL_NAME after applying the same default).
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | |
| MODEL_NAME=$(basename "${EXP:-unknown}" .yaml) | |
| # Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names | |
| if [[ -z "${EXP:-}" ]]; then | |
| export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml" | |
| fi | |
| # Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain) | |
| MODEL_NAME=$(basename "${EXP}" .yaml) |
examples/run_local_pretrain.sh
Outdated
| --env PRIMUS_WORKSPACE \ | ||
| --env PRIMUS_EXP_NAME \ | ||
| --env TIMESTAMP \ | ||
| --env LOG_DIR \ | ||
| --env PRIMUS_TEAM \ | ||
| --env PRIMUS_USER \ |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENV_ARGS already forwards all PRIMUS_ variables into the container (env | grep "^PRIMUS_"), so explicitly passing --env PRIMUS_WORKSPACE/PRIMUS_EXP_NAME/PRIMUS_TEAM/PRIMUS_USER again is redundant and can be confusing to maintain. Prefer relying on the PRIMUS_ pass-through and keep explicit --env only for non-PRIMUS variables like TIMESTAMP/LOG_DIR.
| --env PRIMUS_WORKSPACE \ | |
| --env PRIMUS_EXP_NAME \ | |
| --env TIMESTAMP \ | |
| --env LOG_DIR \ | |
| --env PRIMUS_TEAM \ | |
| --env PRIMUS_USER \ | |
| --env TIMESTAMP \ | |
| --env LOG_DIR \ |
| import os | ||
| from typing import Optional | ||
|
|
||
| from primus.modules.module_utils import log_rank_0, warning_rank_0 |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlflow_artifacts.py logs via log_rank_0/warning_rank_0, but MLflow is initialized on rank world_size - 1 (see global_vars._set_mlflow_writer), so these messages (including upload failures) will be suppressed in typical distributed runs. Use a rank filter that matches the MLflow rank (e.g., log_rank_last), or add/route warnings to a warning_rank_last/log_rank_all path so upload failures are visible.
| from primus.modules.module_utils import log_rank_0, warning_rank_0 | |
| from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0 |
| def upload_artifacts_to_mlflow( | ||
| mlflow_writer, | ||
| tensorboard_dir: Optional[str] = None, | ||
| exp_root_path: Optional[str] = None, | ||
| upload_traces: bool = True, | ||
| upload_logs: bool = True, | ||
| ) -> dict: | ||
| """ |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifact upload behavior is new but currently has no unit tests. Consider adding tests that create a temp tensorboard_dir/exp_root_path with sample *.pt.trace.json(.gz) and *.log files and verify upload_artifacts_to_mlflow() calls mlflow_writer.log_artifact with the expected artifact_path subdirectories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
examples/run_slurm_pretrain.sh:78
- The LOG_FILE variable is not exported but is referenced in the srun command. Since LOG_FILE is defined on line 53 but not exported, when the bash command on line 78 tries to use it with 'tee ${LOG_FILE}', the variable will be empty or undefined on the remote nodes. This will cause the tee command to fail or write to an unexpected location. Either export LOG_FILE after defining it (add 'export LOG_FILE' on line 54), or use the full path expansion within the command string (change to 'tee ${LOG_DIR}/log_slurm_pretrain.txt').
LOG_FILE="${LOG_DIR}/log_slurm_pretrain.txt"
mkdir -p "$LOG_DIR"
srun -N "${NNODES}" \
--exclusive \
--export ALL \
--ntasks-per-node=1 \
--cpus-per-task="${CPUS_PER_TASK:-128}" \
bash -c "
readarray -t node_array < <(scontrol show hostnames \"\$SLURM_JOB_NODELIST\")
if [ \"\$SLURM_NODEID\" = \"0\" ]; then
echo \"========== Slurm cluster info ==========\"
echo \"SLURM_NODELIST: \${node_array[*]}\"
echo \"SLURM_NNODES: \${SLURM_NNODES}\"
echo \"SLURM_GPUS_ON_NODE: \${SLURM_GPUS_ON_NODE}\"
echo \"\"
fi
# Log TIMESTAMP on each node to verify consistency across nodes
echo \"[Node \$SLURM_NODEID] TIMESTAMP=\${TIMESTAMP}\"
export MASTER_ADDR=\${node_array[0]}
export MASTER_PORT=\${MASTER_PORT}
export NNODES=\${SLURM_NNODES}
export NODE_RANK=\${SLURM_PROCID}
export GPUS_PER_NODE=\${SLURM_GPUS_ON_NODE}
export REBUILD_PRIMUS_TURBO=\${REBUILD_PRIMUS_TURBO}
bash ${SCRIPT_DIR}/run_local_pretrain.sh \"\$@\" 2>&1 | tee ${LOG_FILE}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
| # NOTE: When disable_mlflow=false, traces and logs are uploaded by default. | ||
| # Set these to false if you only want metrics/params logged to MLflow. | ||
| mlflow_upload_traces: true # Upload profiler trace files to MLflow | ||
| mlflow_upload_logs: true # Upload training log files to MLflow |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config options default to True when MLflow is enabled. This means traces and logs will be uploaded automatically even if users don't explicitly configure these options. While the comment in the YAML explains this behavior, users who are unaware might experience unexpected uploads of potentially large trace/log files, which could impact performance or storage costs in cloud environments.
Consider changing the default to False for a more conservative approach, or ensure that the documentation clearly highlights this behavior and its implications (especially for trace files which can be large).
| # NOTE: When disable_mlflow=false, traces and logs are uploaded by default. | |
| # Set these to false if you only want metrics/params logged to MLflow. | |
| mlflow_upload_traces: true # Upload profiler trace files to MLflow | |
| mlflow_upload_logs: true # Upload training log files to MLflow | |
| # NOTE: When disable_mlflow=false, traces and logs are NOT uploaded by default. | |
| # Set these to true if you also want traces/logs (which can be large) logged to MLflow. | |
| mlflow_upload_traces: false # Upload profiler trace files to MLflow | |
| mlflow_upload_logs: false # Upload training log files to MLflow |
| # Upload artifacts before ending the run | ||
| upload_mlflow_artifacts( | ||
| upload_traces=getattr(args, "mlflow_upload_traces", True), | ||
| upload_logs=getattr(args, "mlflow_upload_logs", True), | ||
| ) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same synchronization issue as the first upload_mlflow_artifacts call: there's no barrier to ensure all ranks have finished writing their files before upload begins. This could lead to incomplete or corrupted uploads in distributed training scenarios.
Consider adding a torch.distributed.barrier() before upload_mlflow_artifacts() to ensure all ranks have completed their file I/O operations.
| uploaded_count = 0 | ||
| for trace_file in trace_files: | ||
| try: | ||
| # Get relative path from tensorboard_dir for artifact organization | ||
| rel_path = os.path.relpath(trace_file, tensorboard_dir) | ||
| # Determine artifact subdirectory based on file location | ||
| artifact_subpath = ( | ||
| os.path.join(artifact_path, os.path.dirname(rel_path)) | ||
| if os.path.dirname(rel_path) | ||
| else artifact_path | ||
| ) | ||
|
|
||
| mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath) | ||
| uploaded_count += 1 | ||
| log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}") | ||
| except Exception as e: | ||
| warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}") | ||
|
|
||
| log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'") | ||
| return uploaded_count |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upload process iterates through all trace and log files synchronously, uploading them one by one. For large-scale training runs, this could result in a significant number of files (one trace file per profiled rank, multiple log files per rank) and potentially long upload times that block the training completion.
Consider:
- Adding progress logging with a counter (e.g., "Uploaded 5/100 trace files")
- Implementing batch uploads if the MLflow API supports it
- Adding a timeout or size limit configuration option
- Warning users about potential upload times if many/large files are detected
| def upload_artifacts_to_mlflow( | ||
| mlflow_writer, | ||
| tensorboard_dir: Optional[str] = None, | ||
| exp_root_path: Optional[str] = None, | ||
| upload_traces: bool = True, | ||
| upload_logs: bool = True, | ||
| ) -> dict: | ||
| """ | ||
| Upload all artifacts (trace files and log files) to MLflow. | ||
|
|
||
| This is the main entry point for uploading artifacts to MLflow. | ||
| It handles both trace files from profiling and log files from training. | ||
|
|
||
| Args: | ||
| mlflow_writer: The MLflow module instance (from get_mlflow_writer()) | ||
| tensorboard_dir: Path to the tensorboard directory containing trace files | ||
| exp_root_path: Root path of the experiment for log files | ||
| upload_traces: Whether to upload trace files | ||
| upload_logs: Whether to upload log files | ||
|
|
||
| Returns: | ||
| Dictionary with counts of uploaded files: | ||
| { | ||
| "traces": <number of trace files uploaded>, | ||
| "logs": <number of log files uploaded> | ||
| } | ||
| """ | ||
| if mlflow_writer is None: | ||
| log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload") | ||
| return {"traces": 0, "logs": 0} | ||
|
|
||
| log_rank_0("[MLflow] Starting artifact upload to MLflow...") | ||
| log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}") | ||
| log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}") | ||
| log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}") | ||
|
|
||
| result = {"traces": 0, "logs": 0} | ||
|
|
||
| if upload_traces and tensorboard_dir: | ||
| result["traces"] = upload_trace_files_to_mlflow( | ||
| mlflow_writer, tensorboard_dir, artifact_path="traces" | ||
| ) | ||
|
|
||
| if upload_logs and exp_root_path: | ||
| result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs") | ||
|
|
||
| log_rank_0( | ||
| f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files" | ||
| ) | ||
|
|
||
| return result |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In multi-node distributed training, only the last rank (world_size - 1) calls upload_mlflow_artifacts to upload files. However, profiler trace files and log files from other ranks may be located on different node-local filesystems if shared storage is not used. The code assumes all files are accessible from the last rank's filesystem, which may not be true in multi-node scenarios without a shared filesystem.
Consider one of the following approaches:
- Add documentation explaining that shared storage (e.g., NFS) is required for multi-node artifact uploads
- Implement a mechanism to collect files from all nodes (e.g., using distributed file gathering)
- Add a check to warn users if files are expected but not found, which could indicate a shared storage issue
| ############################################################################### | ||
| # Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved. | ||
| # | ||
| # See LICENSE for license information. | ||
| ############################################################################### | ||
|
|
||
| """ | ||
| MLflow Artifact Logging Utilities | ||
|
|
||
| This module provides functions to upload trace files and log files to MLflow | ||
| when MLflow tracking is enabled. | ||
|
|
||
| Features: | ||
| - Upload profiler trace files from all profiled ranks (including multi-node) | ||
| - Upload log files from all levels and all ranks | ||
| - Supports both local and distributed training scenarios | ||
| """ | ||
|
|
||
| import glob | ||
| import os | ||
| from typing import Optional | ||
|
|
||
| from primus.modules.module_utils import log_rank_0, warning_rank_0 | ||
|
|
||
|
|
||
| def _get_all_trace_files(tensorboard_dir: str) -> list: | ||
| """ | ||
| Find all profiler trace files in the tensorboard directory. | ||
|
|
||
| Trace files are typically named like: | ||
| - *.pt.trace.json | ||
| - *.pt.trace.json.gz | ||
|
|
||
| Args: | ||
| tensorboard_dir: Path to the tensorboard directory containing trace files | ||
|
|
||
| Returns: | ||
| List of paths to trace files | ||
| """ | ||
| if not tensorboard_dir or not os.path.exists(tensorboard_dir): | ||
| return [] | ||
|
|
||
| trace_files = [] | ||
| # Look for PyTorch profiler trace files (both compressed and uncompressed) | ||
| patterns = ["*.pt.trace.json", "*.pt.trace.json.gz"] | ||
| # Escape directory path to handle special characters like [] in experiment names | ||
| escaped_dir = glob.escape(tensorboard_dir) | ||
| for pattern in patterns: | ||
| trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern))) | ||
| trace_files.extend(glob.glob(os.path.join(escaped_dir, "**", pattern), recursive=True)) | ||
|
|
||
| # Remove duplicates while preserving order | ||
| seen = set() | ||
| unique_files = [] | ||
| for f in trace_files: | ||
| if f not in seen: | ||
| seen.add(f) | ||
| unique_files.append(f) | ||
|
|
||
| return unique_files | ||
|
|
||
|
|
||
| def _get_all_log_files(exp_root_path: str) -> list: | ||
| """ | ||
| Find all log files in the experiment logs directory. | ||
|
|
||
| Log files are organized as: | ||
| - {exp_root_path}/logs/master/master-*.log | ||
| - {exp_root_path}/logs/{module_name}/rank-{rank}/*.log | ||
|
|
||
| Args: | ||
| exp_root_path: Root path of the experiment | ||
|
|
||
| Returns: | ||
| List of paths to log files | ||
| """ | ||
| if not exp_root_path: | ||
| return [] | ||
|
|
||
| logs_dir = os.path.join(exp_root_path, "logs") | ||
| if not os.path.exists(logs_dir): | ||
| return [] | ||
|
|
||
| log_files = [] | ||
| # Find all .log files recursively (escape path to handle special characters) | ||
| log_files.extend(glob.glob(os.path.join(glob.escape(logs_dir), "**", "*.log"), recursive=True)) | ||
|
|
||
| return log_files | ||
|
|
||
|
|
||
| def upload_trace_files_to_mlflow( | ||
| mlflow_writer, | ||
| tensorboard_dir: str, | ||
| artifact_path: str = "traces", | ||
| ) -> int: | ||
| """ | ||
| Upload all profiler trace files to MLflow as artifacts. | ||
|
|
||
| This function collects trace files from the tensorboard directory and | ||
| uploads them to MLflow. In distributed settings, only rank 0 (or the | ||
| last rank where MLflow writer is initialized) should call this. | ||
|
|
||
| Args: | ||
| mlflow_writer: The MLflow module instance (from get_mlflow_writer()) | ||
| tensorboard_dir: Path to the tensorboard directory containing trace files | ||
| artifact_path: MLflow artifact subdirectory for trace files | ||
|
|
||
| Returns: | ||
| Number of trace files uploaded | ||
| """ | ||
| if mlflow_writer is None: | ||
| return 0 | ||
|
|
||
| log_rank_0(f"[MLflow] Searching for trace files in: {tensorboard_dir}") | ||
| trace_files = _get_all_trace_files(tensorboard_dir) | ||
| if len(trace_files) > 5: | ||
| log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files[:5]}...") | ||
| else: | ||
| log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files}") | ||
|
|
||
| if not trace_files: | ||
| log_rank_0("[MLflow] No trace files found to upload") | ||
| return 0 | ||
|
|
||
| uploaded_count = 0 | ||
| for trace_file in trace_files: | ||
| try: | ||
| # Get relative path from tensorboard_dir for artifact organization | ||
| rel_path = os.path.relpath(trace_file, tensorboard_dir) | ||
| # Determine artifact subdirectory based on file location | ||
| artifact_subpath = ( | ||
| os.path.join(artifact_path, os.path.dirname(rel_path)) | ||
| if os.path.dirname(rel_path) | ||
| else artifact_path | ||
| ) | ||
|
|
||
| mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath) | ||
| uploaded_count += 1 | ||
| log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}") | ||
| except Exception as e: | ||
| warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}") | ||
|
|
||
| log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'") | ||
| return uploaded_count | ||
|
|
||
|
|
||
| def upload_log_files_to_mlflow( | ||
| mlflow_writer, | ||
| exp_root_path: str, | ||
| artifact_path: str = "logs", | ||
| ) -> int: | ||
| """ | ||
| Upload all log files to MLflow as artifacts. | ||
|
|
||
| This function collects log files from all ranks and all log levels | ||
| and uploads them to MLflow. The directory structure is preserved | ||
| in the artifact path. | ||
|
|
||
| Args: | ||
| mlflow_writer: The MLflow module instance (from get_mlflow_writer()) | ||
| exp_root_path: Root path of the experiment | ||
| artifact_path: MLflow artifact subdirectory for log files | ||
|
|
||
| Returns: | ||
| Number of log files uploaded | ||
| """ | ||
| if mlflow_writer is None: | ||
| return 0 | ||
|
|
||
| log_files = _get_all_log_files(exp_root_path) | ||
|
|
||
| if not log_files: | ||
| log_rank_0("[MLflow] No log files found to upload") | ||
| return 0 | ||
|
|
||
| logs_base_dir = os.path.join(exp_root_path, "logs") | ||
| uploaded_count = 0 | ||
|
|
||
| for log_file in log_files: | ||
| try: | ||
| # Preserve directory structure relative to logs base directory | ||
| rel_path = os.path.relpath(log_file, logs_base_dir) | ||
| artifact_subpath = ( | ||
| os.path.join(artifact_path, os.path.dirname(rel_path)) | ||
| if os.path.dirname(rel_path) | ||
| else artifact_path | ||
| ) | ||
|
|
||
| mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath) | ||
| uploaded_count += 1 | ||
| except Exception as e: | ||
| warning_rank_0(f"[MLflow] Failed to upload log file {log_file}: {e}") | ||
|
|
||
| log_rank_0(f"[MLflow] Uploaded {uploaded_count} log files to '{artifact_path}'") | ||
| return uploaded_count | ||
|
|
||
|
|
||
| def upload_artifacts_to_mlflow( | ||
| mlflow_writer, | ||
| tensorboard_dir: Optional[str] = None, | ||
| exp_root_path: Optional[str] = None, | ||
| upload_traces: bool = True, | ||
| upload_logs: bool = True, | ||
| ) -> dict: | ||
| """ | ||
| Upload all artifacts (trace files and log files) to MLflow. | ||
|
|
||
| This is the main entry point for uploading artifacts to MLflow. | ||
| It handles both trace files from profiling and log files from training. | ||
|
|
||
| Args: | ||
| mlflow_writer: The MLflow module instance (from get_mlflow_writer()) | ||
| tensorboard_dir: Path to the tensorboard directory containing trace files | ||
| exp_root_path: Root path of the experiment for log files | ||
| upload_traces: Whether to upload trace files | ||
| upload_logs: Whether to upload log files | ||
|
|
||
| Returns: | ||
| Dictionary with counts of uploaded files: | ||
| { | ||
| "traces": <number of trace files uploaded>, | ||
| "logs": <number of log files uploaded> | ||
| } | ||
| """ | ||
| if mlflow_writer is None: | ||
| log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload") | ||
| return {"traces": 0, "logs": 0} | ||
|
|
||
| log_rank_0("[MLflow] Starting artifact upload to MLflow...") | ||
| log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}") | ||
| log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}") | ||
| log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}") | ||
|
|
||
| result = {"traces": 0, "logs": 0} | ||
|
|
||
| if upload_traces and tensorboard_dir: | ||
| result["traces"] = upload_trace_files_to_mlflow( | ||
| mlflow_writer, tensorboard_dir, artifact_path="traces" | ||
| ) | ||
|
|
||
| if upload_logs and exp_root_path: | ||
| result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs") | ||
|
|
||
| log_rank_0( | ||
| f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files" | ||
| ) | ||
|
|
||
| return result |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new mlflow_artifacts.py module lacks unit tests. Given that the repository has comprehensive test coverage for other megatron backend modules (as seen in tests/unit_tests/backends/megatron/), this module should also have tests to cover:
- File discovery logic (_get_all_trace_files, _get_all_log_files)
- Upload functions with various scenarios (no files, multiple files, error handling)
- Glob escaping for special characters
- Relative path handling
Tests would help ensure reliability, especially for edge cases like special characters in paths or missing directories.
Move MLflow artifact upload functions from global_vars.py to new mlflow_setup.py to reduce merge conflicts: - set_exp_root_path() - get_exp_root_path() - upload_mlflow_artifacts() global_vars.py now matches main, avoiding future conflicts when merging from main branch.
Keep run_pretrain.sh and run_slurm_pretrain.sh as main. Experiment paths can be configured via environment variables: - PRIMUS_TEAM, PRIMUS_USER, PRIMUS_EXP_NAME, PRIMUS_WORKSPACE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| # Escape directory path to handle special characters like [] in experiment names | ||
| escaped_dir = glob.escape(tensorboard_dir) | ||
| for pattern in patterns: | ||
| trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern))) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recursive glob pattern on line 50 could potentially be slow or resource-intensive if the tensorboard_dir contains a very deep directory structure or a large number of files. Consider adding a comment about potential performance implications, or optionally limiting the recursion depth if this becomes a concern in practice.
| trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern))) | |
| trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern))) | |
| # Note: This recursive glob walks the entire tensorboard_dir tree, which may be | |
| # expensive if the directory is very large or deeply nested. If this becomes | |
| # a bottleneck in practice, consider constraining tensorboard_dir or introducing | |
| # a limit on recursion depth. |
| return log_files | ||
|
|
||
|
|
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recursive glob pattern for log files (line 86) will follow symbolic links by default in Python's glob.glob(). If there are symlinks in the logs directory that point outside the intended log directory, this could potentially upload files from unintended locations. Consider using glob.glob(..., recursive=True) with an additional check using os.path.realpath() to ensure files are within the expected directory, or document this behavior if it's intentional.
| return log_files | |
| # Prevent symlinks inside logs_dir from escaping the intended directory | |
| logs_dir_real = os.path.realpath(logs_dir) | |
| filtered_log_files = [] | |
| for path in log_files: | |
| real_path = os.path.realpath(path) | |
| try: | |
| common = os.path.commonpath([logs_dir_real, real_path]) | |
| except ValueError: | |
| # On different drives or invalid paths; treat as outside logs_dir | |
| common = None | |
| if common == logs_dir_real: | |
| filtered_log_files.append(path) | |
| else: | |
| warning_rank_0(f"Skipping log file outside logs directory: {path}") | |
| return filtered_log_files |
| # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. | ||
| # Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved. |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright header states "Copyright (c) 2022, NVIDIA CORPORATION" but this is a newly created file in 2025. Since this file contains entirely new AMD code (as indicated by the PR), consider updating line 2 to reflect only AMD copyright, similar to mlflow_artifacts.py which correctly uses "Copyright (c) 2025, Advanced Micro Devices, Inc."
| # Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. | |
| # Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved. | |
| # Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved. |
| # Upload artifacts before ending the run | ||
| upload_mlflow_artifacts( | ||
| upload_traces=getattr(args, "mlflow_upload_traces", True), | ||
| upload_logs=getattr(args, "mlflow_upload_logs", True), | ||
| ) |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicate code pattern for uploading MLflow artifacts appears in two locations (lines 1130-1134 and 1580-1583). Consider extracting this into a helper function to ensure consistency and maintainability. For example, create a function like finalize_mlflow_run() that handles both artifact upload and run ending.
feat: Add MLflow artifact upload for traces and logs
Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.
Features
artifacts/traces/artifacts/logs/Config Options
mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow
Files Changed
primus/backends/megatron/training/mlflow_artifacts.py- New file with trace/log collection and upload functionsprimus/backends/megatron/training/global_vars.py- Addupload_mlflow_artifacts()wrapperprimus/modules/trainer/megatron/trainer.py- Integrate artifact upload before MLflow run endsprimus/configs/modules/megatron/primus_megatron_module.yaml- Add config optionsexamples/run_pretrain.sh- Add timestamp-based output directoriesexamples/run_slurm_pretrain.sh- Share timestamp across nodes for multi-node runsexamples/run_local_pretrain.sh- Pass MLflow environment variables to containerUsage
When MLflow is enabled, artifacts are automatically uploaded at the end of training:
tensorboard_dir→ MLflowartifacts/traces/exp_root_path/logs/→ MLflowartifacts/logs/