Skip to content

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 18, 2025

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

  • Upload PyTorch profiler trace files to MLflow artifacts/traces/
  • Upload training log files to MLflow artifacts/logs/
  • Unique timestamp-based output directories for multi-node consistency
  • Pass MLflow environment variables through Docker container

Config Options

mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow

Files Changed

  • primus/backends/megatron/training/mlflow_artifacts.py - New file with trace/log collection and upload functions
  • primus/backends/megatron/training/global_vars.py - Add upload_mlflow_artifacts() wrapper
  • primus/modules/trainer/megatron/trainer.py - Integrate artifact upload before MLflow run ends
  • primus/configs/modules/megatron/primus_megatron_module.yaml - Add config options
  • examples/run_pretrain.sh - Add timestamp-based output directories
  • examples/run_slurm_pretrain.sh - Share timestamp across nodes for multi-node runs
  • examples/run_local_pretrain.sh - Pass MLflow environment variables to container

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

  • Trace files from tensorboard_dir → MLflow artifacts/traces/
  • Log files from exp_root_path/logs/ → MLflow artifacts/logs/

- Add mlflow_artifacts.py with functions to collect and upload trace/log files
- Add upload_mlflow_artifacts() wrapper in global_vars.py
- Integrate artifact upload in trainer.py before MLflow run ends
- Add mlflow_upload_traces and mlflow_upload_logs config options
- Add unique timestamp-based output directories for multi-node consistency
- Pass MLflow environment variables through Docker container
Copilot AI review requested due to automatic review settings December 18, 2025 09:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

  • New artifact upload module with functions to collect and upload trace/log files to MLflow
  • Integration of artifact uploads before MLflow run completion in the trainer
  • Configuration options to control trace and log uploads (defaulting to enabled)
  • Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
primus/backends/megatron/training/mlflow_artifacts.py New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI commented Dec 18, 2025

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI review requested due to automatic review settings December 18, 2025 10:30
@gphuang gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 18, 2025 10:37
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gphuang and others added 2 commits December 18, 2025 15:15
The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0]
which are interpreted as glob pattern character classes, causing glob.glob to
return empty results even though files exist.

Fixed by using glob.escape() on directory paths before using them with glob.glob().
Copilot AI review requested due to automatic review settings December 19, 2025 08:26
@gphuang gphuang marked this pull request as ready for review December 19, 2025 08:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 15, 2026 10:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 19, 2026 07:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Copilot AI review requested due to automatic review settings January 22, 2026 08:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Comment on lines 38 to 39
# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_NAME falls back to unknown when EXP is unset, but run_local_pretrain.sh provides a default EXP. This can lead to confusing output directories (e.g., unknown_<ts>) for users relying on defaults. Consider defaulting EXP here as well (or deriving MODEL_NAME after applying the same default).

Suggested change
# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)
# Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names
if [[ -z "${EXP:-}" ]]; then
export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml"
fi
# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
MODEL_NAME=$(basename "${EXP}" .yaml)

Copilot uses AI. Check for mistakes.
Comment on lines 173 to 178
--env PRIMUS_WORKSPACE \
--env PRIMUS_EXP_NAME \
--env TIMESTAMP \
--env LOG_DIR \
--env PRIMUS_TEAM \
--env PRIMUS_USER \
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENV_ARGS already forwards all PRIMUS_ variables into the container (env | grep "^PRIMUS_"), so explicitly passing --env PRIMUS_WORKSPACE/PRIMUS_EXP_NAME/PRIMUS_TEAM/PRIMUS_USER again is redundant and can be confusing to maintain. Prefer relying on the PRIMUS_ pass-through and keep explicit --env only for non-PRIMUS variables like TIMESTAMP/LOG_DIR.

Suggested change
--env PRIMUS_WORKSPACE \
--env PRIMUS_EXP_NAME \
--env TIMESTAMP \
--env LOG_DIR \
--env PRIMUS_TEAM \
--env PRIMUS_USER \
--env TIMESTAMP \
--env LOG_DIR \

Copilot uses AI. Check for mistakes.
import os
from typing import Optional

from primus.modules.module_utils import log_rank_0, warning_rank_0
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlflow_artifacts.py logs via log_rank_0/warning_rank_0, but MLflow is initialized on rank world_size - 1 (see global_vars._set_mlflow_writer), so these messages (including upload failures) will be suppressed in typical distributed runs. Use a rank filter that matches the MLflow rank (e.g., log_rank_last), or add/route warnings to a warning_rank_last/log_rank_all path so upload failures are visible.

Suggested change
from primus.modules.module_utils import log_rank_0, warning_rank_0
from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +205
def upload_artifacts_to_mlflow(
mlflow_writer,
tensorboard_dir: Optional[str] = None,
exp_root_path: Optional[str] = None,
upload_traces: bool = True,
upload_logs: bool = True,
) -> dict:
"""
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifact upload behavior is new but currently has no unit tests. Consider adding tests that create a temp tensorboard_dir/exp_root_path with sample *.pt.trace.json(.gz) and *.log files and verify upload_artifacts_to_mlflow() calls mlflow_writer.log_artifact with the expected artifact_path subdirectories.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 26, 2026 09:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

examples/run_slurm_pretrain.sh:78

  • The LOG_FILE variable is not exported but is referenced in the srun command. Since LOG_FILE is defined on line 53 but not exported, when the bash command on line 78 tries to use it with 'tee ${LOG_FILE}', the variable will be empty or undefined on the remote nodes. This will cause the tee command to fail or write to an unexpected location. Either export LOG_FILE after defining it (add 'export LOG_FILE' on line 54), or use the full path expansion within the command string (change to 'tee ${LOG_DIR}/log_slurm_pretrain.txt').
LOG_FILE="${LOG_DIR}/log_slurm_pretrain.txt"
mkdir -p "$LOG_DIR"

srun -N "${NNODES}" \
     --exclusive \
     --export ALL \
     --ntasks-per-node=1 \
     --cpus-per-task="${CPUS_PER_TASK:-128}" \
     bash -c "
          readarray -t node_array < <(scontrol show hostnames \"\$SLURM_JOB_NODELIST\")
          if [ \"\$SLURM_NODEID\" = \"0\" ]; then
              echo \"========== Slurm cluster info ==========\"
              echo \"SLURM_NODELIST: \${node_array[*]}\"
              echo \"SLURM_NNODES: \${SLURM_NNODES}\"
              echo \"SLURM_GPUS_ON_NODE: \${SLURM_GPUS_ON_NODE}\"
              echo \"\"
          fi
          # Log TIMESTAMP on each node to verify consistency across nodes
          echo \"[Node \$SLURM_NODEID] TIMESTAMP=\${TIMESTAMP}\"
          export MASTER_ADDR=\${node_array[0]}
          export MASTER_PORT=\${MASTER_PORT}
          export NNODES=\${SLURM_NNODES}
          export NODE_RANK=\${SLURM_PROCID}
          export GPUS_PER_NODE=\${SLURM_GPUS_ON_NODE}
          export REBUILD_PRIMUS_TURBO=\${REBUILD_PRIMUS_TURBO}
          bash ${SCRIPT_DIR}/run_local_pretrain.sh \"\$@\" 2>&1 | tee ${LOG_FILE}

Copilot AI review requested due to automatic review settings February 2, 2026 12:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Comment on lines +8 to +11
# NOTE: When disable_mlflow=false, traces and logs are uploaded by default.
# Set these to false if you only want metrics/params logged to MLflow.
mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config options default to True when MLflow is enabled. This means traces and logs will be uploaded automatically even if users don't explicitly configure these options. While the comment in the YAML explains this behavior, users who are unaware might experience unexpected uploads of potentially large trace/log files, which could impact performance or storage costs in cloud environments.

Consider changing the default to False for a more conservative approach, or ensure that the documentation clearly highlights this behavior and its implications (especially for trace files which can be large).

Suggested change
# NOTE: When disable_mlflow=false, traces and logs are uploaded by default.
# Set these to false if you only want metrics/params logged to MLflow.
mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow
# NOTE: When disable_mlflow=false, traces and logs are NOT uploaded by default.
# Set these to true if you also want traces/logs (which can be large) logged to MLflow.
mlflow_upload_traces: false # Upload profiler trace files to MLflow
mlflow_upload_logs: false # Upload training log files to MLflow

Copilot uses AI. Check for mistakes.
Comment on lines +1577 to +1581
# Upload artifacts before ending the run
upload_mlflow_artifacts(
upload_traces=getattr(args, "mlflow_upload_traces", True),
upload_logs=getattr(args, "mlflow_upload_logs", True),
)
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same synchronization issue as the first upload_mlflow_artifacts call: there's no barrier to ensure all ranks have finished writing their files before upload begins. This could lead to incomplete or corrupted uploads in distributed training scenarios.

Consider adding a torch.distributed.barrier() before upload_mlflow_artifacts() to ensure all ranks have completed their file I/O operations.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +144
uploaded_count = 0
for trace_file in trace_files:
try:
# Get relative path from tensorboard_dir for artifact organization
rel_path = os.path.relpath(trace_file, tensorboard_dir)
# Determine artifact subdirectory based on file location
artifact_subpath = (
os.path.join(artifact_path, os.path.dirname(rel_path))
if os.path.dirname(rel_path)
else artifact_path
)

mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath)
uploaded_count += 1
log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}")
except Exception as e:
warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}")

log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'")
return uploaded_count
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upload process iterates through all trace and log files synchronously, uploading them one by one. For large-scale training runs, this could result in a significant number of files (one trace file per profiled rank, multiple log files per rank) and potentially long upload times that block the training completion.

Consider:

  1. Adding progress logging with a counter (e.g., "Uploaded 5/100 trace files")
  2. Implementing batch uploads if the MLflow API supports it
  3. Adding a timeout or size limit configuration option
  4. Warning users about potential upload times if many/large files are detected

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +248
def upload_artifacts_to_mlflow(
mlflow_writer,
tensorboard_dir: Optional[str] = None,
exp_root_path: Optional[str] = None,
upload_traces: bool = True,
upload_logs: bool = True,
) -> dict:
"""
Upload all artifacts (trace files and log files) to MLflow.

This is the main entry point for uploading artifacts to MLflow.
It handles both trace files from profiling and log files from training.

Args:
mlflow_writer: The MLflow module instance (from get_mlflow_writer())
tensorboard_dir: Path to the tensorboard directory containing trace files
exp_root_path: Root path of the experiment for log files
upload_traces: Whether to upload trace files
upload_logs: Whether to upload log files

Returns:
Dictionary with counts of uploaded files:
{
"traces": <number of trace files uploaded>,
"logs": <number of log files uploaded>
}
"""
if mlflow_writer is None:
log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload")
return {"traces": 0, "logs": 0}

log_rank_0("[MLflow] Starting artifact upload to MLflow...")
log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}")
log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}")
log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}")

result = {"traces": 0, "logs": 0}

if upload_traces and tensorboard_dir:
result["traces"] = upload_trace_files_to_mlflow(
mlflow_writer, tensorboard_dir, artifact_path="traces"
)

if upload_logs and exp_root_path:
result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs")

log_rank_0(
f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files"
)

return result
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-node distributed training, only the last rank (world_size - 1) calls upload_mlflow_artifacts to upload files. However, profiler trace files and log files from other ranks may be located on different node-local filesystems if shared storage is not used. The code assumes all files are accessible from the last rank's filesystem, which may not be true in multi-node scenarios without a shared filesystem.

Consider one of the following approaches:

  1. Add documentation explaining that shared storage (e.g., NFS) is required for multi-node artifact uploads
  2. Implement a mechanism to collect files from all nodes (e.g., using distributed file gathering)
  3. Add a check to warn users if files are expected but not found, which could indicate a shared storage issue

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +248
###############################################################################
# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
#
# See LICENSE for license information.
###############################################################################

"""
MLflow Artifact Logging Utilities

This module provides functions to upload trace files and log files to MLflow
when MLflow tracking is enabled.

Features:
- Upload profiler trace files from all profiled ranks (including multi-node)
- Upload log files from all levels and all ranks
- Supports both local and distributed training scenarios
"""

import glob
import os
from typing import Optional

from primus.modules.module_utils import log_rank_0, warning_rank_0


def _get_all_trace_files(tensorboard_dir: str) -> list:
"""
Find all profiler trace files in the tensorboard directory.

Trace files are typically named like:
- *.pt.trace.json
- *.pt.trace.json.gz

Args:
tensorboard_dir: Path to the tensorboard directory containing trace files

Returns:
List of paths to trace files
"""
if not tensorboard_dir or not os.path.exists(tensorboard_dir):
return []

trace_files = []
# Look for PyTorch profiler trace files (both compressed and uncompressed)
patterns = ["*.pt.trace.json", "*.pt.trace.json.gz"]
# Escape directory path to handle special characters like [] in experiment names
escaped_dir = glob.escape(tensorboard_dir)
for pattern in patterns:
trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
trace_files.extend(glob.glob(os.path.join(escaped_dir, "**", pattern), recursive=True))

# Remove duplicates while preserving order
seen = set()
unique_files = []
for f in trace_files:
if f not in seen:
seen.add(f)
unique_files.append(f)

return unique_files


def _get_all_log_files(exp_root_path: str) -> list:
"""
Find all log files in the experiment logs directory.

Log files are organized as:
- {exp_root_path}/logs/master/master-*.log
- {exp_root_path}/logs/{module_name}/rank-{rank}/*.log

Args:
exp_root_path: Root path of the experiment

Returns:
List of paths to log files
"""
if not exp_root_path:
return []

logs_dir = os.path.join(exp_root_path, "logs")
if not os.path.exists(logs_dir):
return []

log_files = []
# Find all .log files recursively (escape path to handle special characters)
log_files.extend(glob.glob(os.path.join(glob.escape(logs_dir), "**", "*.log"), recursive=True))

return log_files


def upload_trace_files_to_mlflow(
mlflow_writer,
tensorboard_dir: str,
artifact_path: str = "traces",
) -> int:
"""
Upload all profiler trace files to MLflow as artifacts.

This function collects trace files from the tensorboard directory and
uploads them to MLflow. In distributed settings, only rank 0 (or the
last rank where MLflow writer is initialized) should call this.

Args:
mlflow_writer: The MLflow module instance (from get_mlflow_writer())
tensorboard_dir: Path to the tensorboard directory containing trace files
artifact_path: MLflow artifact subdirectory for trace files

Returns:
Number of trace files uploaded
"""
if mlflow_writer is None:
return 0

log_rank_0(f"[MLflow] Searching for trace files in: {tensorboard_dir}")
trace_files = _get_all_trace_files(tensorboard_dir)
if len(trace_files) > 5:
log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files[:5]}...")
else:
log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files}")

if not trace_files:
log_rank_0("[MLflow] No trace files found to upload")
return 0

uploaded_count = 0
for trace_file in trace_files:
try:
# Get relative path from tensorboard_dir for artifact organization
rel_path = os.path.relpath(trace_file, tensorboard_dir)
# Determine artifact subdirectory based on file location
artifact_subpath = (
os.path.join(artifact_path, os.path.dirname(rel_path))
if os.path.dirname(rel_path)
else artifact_path
)

mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath)
uploaded_count += 1
log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}")
except Exception as e:
warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}")

log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'")
return uploaded_count


def upload_log_files_to_mlflow(
mlflow_writer,
exp_root_path: str,
artifact_path: str = "logs",
) -> int:
"""
Upload all log files to MLflow as artifacts.

This function collects log files from all ranks and all log levels
and uploads them to MLflow. The directory structure is preserved
in the artifact path.

Args:
mlflow_writer: The MLflow module instance (from get_mlflow_writer())
exp_root_path: Root path of the experiment
artifact_path: MLflow artifact subdirectory for log files

Returns:
Number of log files uploaded
"""
if mlflow_writer is None:
return 0

log_files = _get_all_log_files(exp_root_path)

if not log_files:
log_rank_0("[MLflow] No log files found to upload")
return 0

logs_base_dir = os.path.join(exp_root_path, "logs")
uploaded_count = 0

for log_file in log_files:
try:
# Preserve directory structure relative to logs base directory
rel_path = os.path.relpath(log_file, logs_base_dir)
artifact_subpath = (
os.path.join(artifact_path, os.path.dirname(rel_path))
if os.path.dirname(rel_path)
else artifact_path
)

mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
uploaded_count += 1
except Exception as e:
warning_rank_0(f"[MLflow] Failed to upload log file {log_file}: {e}")

log_rank_0(f"[MLflow] Uploaded {uploaded_count} log files to '{artifact_path}'")
return uploaded_count


def upload_artifacts_to_mlflow(
mlflow_writer,
tensorboard_dir: Optional[str] = None,
exp_root_path: Optional[str] = None,
upload_traces: bool = True,
upload_logs: bool = True,
) -> dict:
"""
Upload all artifacts (trace files and log files) to MLflow.

This is the main entry point for uploading artifacts to MLflow.
It handles both trace files from profiling and log files from training.

Args:
mlflow_writer: The MLflow module instance (from get_mlflow_writer())
tensorboard_dir: Path to the tensorboard directory containing trace files
exp_root_path: Root path of the experiment for log files
upload_traces: Whether to upload trace files
upload_logs: Whether to upload log files

Returns:
Dictionary with counts of uploaded files:
{
"traces": <number of trace files uploaded>,
"logs": <number of log files uploaded>
}
"""
if mlflow_writer is None:
log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload")
return {"traces": 0, "logs": 0}

log_rank_0("[MLflow] Starting artifact upload to MLflow...")
log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}")
log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}")
log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}")

result = {"traces": 0, "logs": 0}

if upload_traces and tensorboard_dir:
result["traces"] = upload_trace_files_to_mlflow(
mlflow_writer, tensorboard_dir, artifact_path="traces"
)

if upload_logs and exp_root_path:
result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs")

log_rank_0(
f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files"
)

return result
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new mlflow_artifacts.py module lacks unit tests. Given that the repository has comprehensive test coverage for other megatron backend modules (as seen in tests/unit_tests/backends/megatron/), this module should also have tests to cover:

  • File discovery logic (_get_all_trace_files, _get_all_log_files)
  • Upload functions with various scenarios (no files, multiple files, error handling)
  • Glob escaping for special characters
  • Relative path handling

Tests would help ensure reliability, especially for edge cases like special characters in paths or missing directories.

Copilot uses AI. Check for mistakes.
Move MLflow artifact upload functions from global_vars.py to new
mlflow_setup.py to reduce merge conflicts:
- set_exp_root_path()
- get_exp_root_path()
- upload_mlflow_artifacts()

global_vars.py now matches main, avoiding future conflicts when
merging from main branch.
Keep run_pretrain.sh and run_slurm_pretrain.sh as main.
Experiment paths can be configured via environment variables:
- PRIMUS_TEAM, PRIMUS_USER, PRIMUS_EXP_NAME, PRIMUS_WORKSPACE
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

# Escape directory path to handle special characters like [] in experiment names
escaped_dir = glob.escape(tensorboard_dir)
for pattern in patterns:
trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive glob pattern on line 50 could potentially be slow or resource-intensive if the tensorboard_dir contains a very deep directory structure or a large number of files. Consider adding a comment about potential performance implications, or optionally limiting the recursion depth if this becomes a concern in practice.

Suggested change
trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
# Note: This recursive glob walks the entire tensorboard_dir tree, which may be
# expensive if the directory is very large or deeply nested. If this becomes
# a bottleneck in practice, consider constraining tensorboard_dir or introducing
# a limit on recursion depth.

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +90
return log_files


Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recursive glob pattern for log files (line 86) will follow symbolic links by default in Python's glob.glob(). If there are symlinks in the logs directory that point outside the intended log directory, this could potentially upload files from unintended locations. Consider using glob.glob(..., recursive=True) with an additional check using os.path.realpath() to ensure files are within the expected directory, or document this behavior if it's intentional.

Suggested change
return log_files
# Prevent symlinks inside logs_dir from escaping the intended directory
logs_dir_real = os.path.realpath(logs_dir)
filtered_log_files = []
for path in log_files:
real_path = os.path.realpath(path)
try:
common = os.path.commonpath([logs_dir_real, real_path])
except ValueError:
# On different drives or invalid paths; treat as outside logs_dir
common = None
if common == logs_dir_real:
filtered_log_files.append(path)
else:
warning_rank_0(f"Skipping log file outside logs directory: {path}")
return filtered_log_files

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +3
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright header states "Copyright (c) 2022, NVIDIA CORPORATION" but this is a newly created file in 2025. Since this file contains entirely new AMD code (as indicated by the PR), consider updating line 2 to reflect only AMD copyright, similar to mlflow_artifacts.py which correctly uses "Copyright (c) 2025, Advanced Micro Devices, Inc."

Suggested change
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.

Copilot uses AI. Check for mistakes.
Comment on lines +1130 to +1134
# Upload artifacts before ending the run
upload_mlflow_artifacts(
upload_traces=getattr(args, "mlflow_upload_traces", True),
upload_logs=getattr(args, "mlflow_upload_logs", True),
)
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicate code pattern for uploading MLflow artifacts appears in two locations (lines 1130-1134 and 1580-1583). Consider extracting this into a helper function to ensure consistency and maintainability. For example, create a function like finalize_mlflow_run() that handles both artifact upload and run ending.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants