feat: Add MLflow artifact upload for traces and logs #440

gphuang · 2025-12-18T09:10:45Z

feat: Add MLflow artifact upload for traces and logs

Adds functionality to automatically upload profiler trace files and training log files
to MLflow as artifacts when MLflow tracking is enabled.

Features

Upload PyTorch profiler trace files to MLflow artifacts/traces/
Upload training log files to MLflow artifacts/logs/
Unique timestamp-based output directories for multi-node consistency
Pass MLflow environment variables through Docker container

Config Options

mlflow_upload_traces: true # Upload profiler trace files to MLflow
mlflow_upload_logs: true # Upload training log files to MLflow

Files Changed

primus/backends/megatron/training/mlflow_artifacts.py - New file with trace/log collection and upload functions
primus/backends/megatron/training/global_vars.py - Add upload_mlflow_artifacts() wrapper
primus/modules/trainer/megatron/trainer.py - Integrate artifact upload before MLflow run ends
primus/configs/modules/megatron/primus_megatron_module.yaml - Add config options
examples/run_pretrain.sh - Add timestamp-based output directories
examples/run_slurm_pretrain.sh - Share timestamp across nodes for multi-node runs
examples/run_local_pretrain.sh - Pass MLflow environment variables to container

Usage

When MLflow is enabled, artifacts are automatically uploaded at the end of training:

Trace files from tensorboard_dir → MLflow artifacts/traces/
Log files from exp_root_path/logs/ → MLflow artifacts/logs/

- Add mlflow_artifacts.py with functions to collect and upload trace/log files - Add upload_mlflow_artifacts() wrapper in global_vars.py - Integrate artifact upload in trainer.py before MLflow run ends - Add mlflow_upload_traces and mlflow_upload_logs config options - Add unique timestamp-based output directories for multi-node consistency - Pass MLflow environment variables through Docker container

Copilot

Pull request overview

This PR adds functionality to automatically upload PyTorch profiler trace files and training log files to MLflow as artifacts when MLflow tracking is enabled. The implementation introduces a new module for artifact collection and upload, integrates it into the training lifecycle, and updates example scripts to support consistent output directories across multi-node training runs.

Key changes:

New artifact upload module with functions to collect and upload trace/log files to MLflow
Integration of artifact uploads before MLflow run completion in the trainer
Configuration options to control trace and log uploads (defaulting to enabled)
Shell script improvements for timestamp-based output directories with multi-node consistency

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New module implementing trace/log file discovery and MLflow artifact upload functionality
primus/backends/megatron/training/global_vars.py	Adds global variable for exp_root_path and wrapper function for artifact uploads
primus/modules/trainer/megatron/trainer.py	Integrates artifact upload calls before MLflow run termination in two exit paths
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds mlflow_upload_traces and mlflow_upload_logs config options (both default to true)
examples/run_slurm_pretrain.sh	Implements timestamp-based output directory naming and exports timestamp for multi-node consistency
examples/run_pretrain.sh	Adds conditional timestamp generation to support both single-node and multi-node scenarios, fixes typo in log message
examples/run_local_pretrain.sh	Adds MLflow environment variables and Primus path variables to Docker container environment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

Copilot · 2025-12-18T10:20:26Z

@gphuang I've opened a new pull request, #441, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

The experiment name contains square brackets like [deepseek_v2_lite-pretrain_...]-rank[0] which are interpreted as glob pattern character classes, causing glob.glob to return empty results even though files exist. Fixed by using glob.escape() on directory paths before using them with glob.glob().

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_slurm_pretrain.sh

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

primus/backends/megatron/training/mlflow_artifacts.py

examples/run_pretrain.sh

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

examples/run_pretrain.sh

Copilot · 2026-01-22T08:33:10Z

examples/run_slurm_pretrain.sh

+# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
+MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)


MODEL_NAME falls back to unknown when EXP is unset, but run_local_pretrain.sh provides a default EXP. This can lead to confusing output directories (e.g., unknown_<ts>) for users relying on defaults. Consider defaulting EXP here as well (or deriving MODEL_NAME after applying the same default).

Suggested change

# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)

MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)

# Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names

if [[ -z "${EXP:-}" ]]; then

export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml"

fi

# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)

MODEL_NAME=$(basename "${EXP}" .yaml)

Copilot · 2026-01-22T08:33:11Z

examples/run_local_pretrain.sh

+    --env PRIMUS_WORKSPACE \
+    --env PRIMUS_EXP_NAME \
+    --env TIMESTAMP \
+    --env LOG_DIR \
+    --env PRIMUS_TEAM \
+    --env PRIMUS_USER \


ENV_ARGS already forwards all PRIMUS_ variables into the container (env | grep "^PRIMUS_"), so explicitly passing --env PRIMUS_WORKSPACE/PRIMUS_EXP_NAME/PRIMUS_TEAM/PRIMUS_USER again is redundant and can be confusing to maintain. Prefer relying on the PRIMUS_ pass-through and keep explicit --env only for non-PRIMUS variables like TIMESTAMP/LOG_DIR.

Suggested change

--env PRIMUS_WORKSPACE \

--env PRIMUS_EXP_NAME \

--env TIMESTAMP \

--env LOG_DIR \

--env PRIMUS_TEAM \

--env PRIMUS_USER \

--env TIMESTAMP \

--env LOG_DIR \

Copilot · 2026-01-22T08:33:11Z

primus/backends/megatron/training/mlflow_artifacts.py

+import os
+from typing import Optional
+
+from primus.modules.module_utils import log_rank_0, warning_rank_0


mlflow_artifacts.py logs via log_rank_0/warning_rank_0, but MLflow is initialized on rank world_size - 1 (see global_vars._set_mlflow_writer), so these messages (including upload failures) will be suppressed in typical distributed runs. Use a rank filter that matches the MLflow rank (e.g., log_rank_last), or add/route warnings to a warning_rank_last/log_rank_all path so upload failures are visible.

Suggested change

from primus.modules.module_utils import log_rank_0, warning_rank_0

from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0

Copilot · 2026-01-22T08:33:11Z

primus/backends/megatron/training/mlflow_artifacts.py

+def upload_artifacts_to_mlflow(
+    mlflow_writer,
+    tensorboard_dir: Optional[str] = None,
+    exp_root_path: Optional[str] = None,
+    upload_traces: bool = True,
+    upload_logs: bool = True,
+) -> dict:
+    """


Artifact upload behavior is new but currently has no unit tests. Consider adding tests that create a temp tensorboard_dir/exp_root_path with sample *.pt.trace.json(.gz) and *.log files and verify upload_artifacts_to_mlflow() calls mlflow_writer.log_artifact with the expected artifact_path subdirectories.

examples/run_pretrain.sh

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

examples/run_slurm_pretrain.sh:78

The LOG_FILE variable is not exported but is referenced in the srun command. Since LOG_FILE is defined on line 53 but not exported, when the bash command on line 78 tries to use it with 'tee ${LOG_FILE}', the variable will be empty or undefined on the remote nodes. This will cause the tee command to fail or write to an unexpected location. Either export LOG_FILE after defining it (add 'export LOG_FILE' on line 54), or use the full path expansion within the command string (change to 'tee ${LOG_DIR}/log_slurm_pretrain.txt').

LOG_FILE="${LOG_DIR}/log_slurm_pretrain.txt"
mkdir -p "$LOG_DIR"

srun -N "${NNODES}" \
     --exclusive \
     --export ALL \
     --ntasks-per-node=1 \
     --cpus-per-task="${CPUS_PER_TASK:-128}" \
     bash -c "
          readarray -t node_array < <(scontrol show hostnames \"\$SLURM_JOB_NODELIST\")
          if [ \"\$SLURM_NODEID\" = \"0\" ]; then
              echo \"========== Slurm cluster info ==========\"
              echo \"SLURM_NODELIST: \${node_array[*]}\"
              echo \"SLURM_NNODES: \${SLURM_NNODES}\"
              echo \"SLURM_GPUS_ON_NODE: \${SLURM_GPUS_ON_NODE}\"
              echo \"\"
          fi
          # Log TIMESTAMP on each node to verify consistency across nodes
          echo \"[Node \$SLURM_NODEID] TIMESTAMP=\${TIMESTAMP}\"
          export MASTER_ADDR=\${node_array[0]}
          export MASTER_PORT=\${MASTER_PORT}
          export NNODES=\${SLURM_NNODES}
          export NODE_RANK=\${SLURM_PROCID}
          export GPUS_PER_NODE=\${SLURM_GPUS_ON_NODE}
          export REBUILD_PRIMUS_TURBO=\${REBUILD_PRIMUS_TURBO}
          bash ${SCRIPT_DIR}/run_local_pretrain.sh \"\$@\" 2>&1 | tee ${LOG_FILE}

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Copilot · 2026-02-02T12:36:45Z

primus/configs/modules/megatron/primus_megatron_module.yaml

+# NOTE: When disable_mlflow=false, traces and logs are uploaded by default.
+# Set these to false if you only want metrics/params logged to MLflow.
+mlflow_upload_traces: true  # Upload profiler trace files to MLflow
+mlflow_upload_logs: true    # Upload training log files to MLflow


The config options default to True when MLflow is enabled. This means traces and logs will be uploaded automatically even if users don't explicitly configure these options. While the comment in the YAML explains this behavior, users who are unaware might experience unexpected uploads of potentially large trace/log files, which could impact performance or storage costs in cloud environments.

Consider changing the default to False for a more conservative approach, or ensure that the documentation clearly highlights this behavior and its implications (especially for trace files which can be large).

Suggested change

# NOTE: When disable_mlflow=false, traces and logs are uploaded by default.

# Set these to false if you only want metrics/params logged to MLflow.

mlflow_upload_traces: true # Upload profiler trace files to MLflow

mlflow_upload_logs: true # Upload training log files to MLflow

# NOTE: When disable_mlflow=false, traces and logs are NOT uploaded by default.

# Set these to true if you also want traces/logs (which can be large) logged to MLflow.

mlflow_upload_traces: false # Upload profiler trace files to MLflow

mlflow_upload_logs: false # Upload training log files to MLflow

Copilot · 2026-02-02T12:36:46Z

primus/modules/trainer/megatron/trainer.py

+                # Upload artifacts before ending the run
+                upload_mlflow_artifacts(
+                    upload_traces=getattr(args, "mlflow_upload_traces", True),
+                    upload_logs=getattr(args, "mlflow_upload_logs", True),
+                )


Same synchronization issue as the first upload_mlflow_artifacts call: there's no barrier to ensure all ranks have finished writing their files before upload begins. This could lead to incomplete or corrupted uploads in distributed training scenarios.

Consider adding a torch.distributed.barrier() before upload_mlflow_artifacts() to ensure all ranks have completed their file I/O operations.

Copilot · 2026-02-02T12:36:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+    uploaded_count = 0
+    for trace_file in trace_files:
+        try:
+            # Get relative path from tensorboard_dir for artifact organization
+            rel_path = os.path.relpath(trace_file, tensorboard_dir)
+            # Determine artifact subdirectory based on file location
+            artifact_subpath = (
+                os.path.join(artifact_path, os.path.dirname(rel_path))
+                if os.path.dirname(rel_path)
+                else artifact_path
+            )
+
+            mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath)
+            uploaded_count += 1
+            log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}")
+        except Exception as e:
+            warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}")
+
+    log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'")
+    return uploaded_count


The upload process iterates through all trace and log files synchronously, uploading them one by one. For large-scale training runs, this could result in a significant number of files (one trace file per profiled rank, multiple log files per rank) and potentially long upload times that block the training completion.

Consider:

Adding progress logging with a counter (e.g., "Uploaded 5/100 trace files")

Implementing batch uploads if the MLflow API supports it

Adding a timeout or size limit configuration option

Warning users about potential upload times if many/large files are detected

Copilot · 2026-02-02T12:36:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+def upload_artifacts_to_mlflow(
+    mlflow_writer,
+    tensorboard_dir: Optional[str] = None,
+    exp_root_path: Optional[str] = None,
+    upload_traces: bool = True,
+    upload_logs: bool = True,
+) -> dict:
+    """
+    Upload all artifacts (trace files and log files) to MLflow.
+
+    This is the main entry point for uploading artifacts to MLflow.
+    It handles both trace files from profiling and log files from training.
+
+    Args:
+        mlflow_writer: The MLflow module instance (from get_mlflow_writer())
+        tensorboard_dir: Path to the tensorboard directory containing trace files
+        exp_root_path: Root path of the experiment for log files
+        upload_traces: Whether to upload trace files
+        upload_logs: Whether to upload log files
+
+    Returns:
+        Dictionary with counts of uploaded files:
+        {
+            "traces": <number of trace files uploaded>,
+            "logs": <number of log files uploaded>
+        }
+    """
+    if mlflow_writer is None:
+        log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload")
+        return {"traces": 0, "logs": 0}
+
+    log_rank_0("[MLflow] Starting artifact upload to MLflow...")
+    log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}")
+    log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}")
+    log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}")
+
+    result = {"traces": 0, "logs": 0}
+
+    if upload_traces and tensorboard_dir:
+        result["traces"] = upload_trace_files_to_mlflow(
+            mlflow_writer, tensorboard_dir, artifact_path="traces"
+        )
+
+    if upload_logs and exp_root_path:
+        result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs")
+
+    log_rank_0(
+        f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files"
+    )
+
+    return result


In multi-node distributed training, only the last rank (world_size - 1) calls upload_mlflow_artifacts to upload files. However, profiler trace files and log files from other ranks may be located on different node-local filesystems if shared storage is not used. The code assumes all files are accessible from the last rank's filesystem, which may not be true in multi-node scenarios without a shared filesystem.

Consider one of the following approaches:

Add documentation explaining that shared storage (e.g., NFS) is required for multi-node artifact uploads

Implement a mechanism to collect files from all nodes (e.g., using distributed file gathering)

Add a check to warn users if files are expected but not found, which could indicate a shared storage issue

Copilot · 2026-02-02T12:36:47Z

primus/backends/megatron/training/mlflow_artifacts.py

+###############################################################################
+# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
+#
+# See LICENSE for license information.
+###############################################################################
+
+"""
+MLflow Artifact Logging Utilities
+
+This module provides functions to upload trace files and log files to MLflow
+when MLflow tracking is enabled.
+
+Features:
+- Upload profiler trace files from all profiled ranks (including multi-node)
+- Upload log files from all levels and all ranks
+- Supports both local and distributed training scenarios
+"""
+
+import glob
+import os
+from typing import Optional
+
+from primus.modules.module_utils import log_rank_0, warning_rank_0
+
+
+def _get_all_trace_files(tensorboard_dir: str) -> list:
+    """
+    Find all profiler trace files in the tensorboard directory.
+
+    Trace files are typically named like:
+    - *.pt.trace.json
+    - *.pt.trace.json.gz
+
+    Args:
+        tensorboard_dir: Path to the tensorboard directory containing trace files
+
+    Returns:
+        List of paths to trace files
+    """
+    if not tensorboard_dir or not os.path.exists(tensorboard_dir):
+        return []
+
+    trace_files = []
+    # Look for PyTorch profiler trace files (both compressed and uncompressed)
+    patterns = ["*.pt.trace.json", "*.pt.trace.json.gz"]
+    # Escape directory path to handle special characters like [] in experiment names
+    escaped_dir = glob.escape(tensorboard_dir)
+    for pattern in patterns:
+        trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
+        trace_files.extend(glob.glob(os.path.join(escaped_dir, "**", pattern), recursive=True))
+
+    # Remove duplicates while preserving order
+    seen = set()
+    unique_files = []
+    for f in trace_files:
+        if f not in seen:
+            seen.add(f)
+            unique_files.append(f)
+
+    return unique_files
+
+
+def _get_all_log_files(exp_root_path: str) -> list:
+    """
+    Find all log files in the experiment logs directory.
+
+    Log files are organized as:
+    - {exp_root_path}/logs/master/master-*.log
+    - {exp_root_path}/logs/{module_name}/rank-{rank}/*.log
+
+    Args:
+        exp_root_path: Root path of the experiment
+
+    Returns:
+        List of paths to log files
+    """
+    if not exp_root_path:
+        return []
+
+    logs_dir = os.path.join(exp_root_path, "logs")
+    if not os.path.exists(logs_dir):
+        return []
+
+    log_files = []
+    # Find all .log files recursively (escape path to handle special characters)
+    log_files.extend(glob.glob(os.path.join(glob.escape(logs_dir), "**", "*.log"), recursive=True))
+
+    return log_files
+
+
+def upload_trace_files_to_mlflow(
+    mlflow_writer,
+    tensorboard_dir: str,
+    artifact_path: str = "traces",
+) -> int:
+    """
+    Upload all profiler trace files to MLflow as artifacts.
+
+    This function collects trace files from the tensorboard directory and
+    uploads them to MLflow. In distributed settings, only rank 0 (or the
+    last rank where MLflow writer is initialized) should call this.
+
+    Args:
+        mlflow_writer: The MLflow module instance (from get_mlflow_writer())
+        tensorboard_dir: Path to the tensorboard directory containing trace files
+        artifact_path: MLflow artifact subdirectory for trace files
+
+    Returns:
+        Number of trace files uploaded
+    """
+    if mlflow_writer is None:
+        return 0
+
+    log_rank_0(f"[MLflow] Searching for trace files in: {tensorboard_dir}")
+    trace_files = _get_all_trace_files(tensorboard_dir)
+    if len(trace_files) > 5:
+        log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files[:5]}...")
+    else:
+        log_rank_0(f"[MLflow] Found {len(trace_files)} trace files: {trace_files}")
+
+    if not trace_files:
+        log_rank_0("[MLflow] No trace files found to upload")
+        return 0
+
+    uploaded_count = 0
+    for trace_file in trace_files:
+        try:
+            # Get relative path from tensorboard_dir for artifact organization
+            rel_path = os.path.relpath(trace_file, tensorboard_dir)
+            # Determine artifact subdirectory based on file location
+            artifact_subpath = (
+                os.path.join(artifact_path, os.path.dirname(rel_path))
+                if os.path.dirname(rel_path)
+                else artifact_path
+            )
+
+            mlflow_writer.log_artifact(trace_file, artifact_path=artifact_subpath)
+            uploaded_count += 1
+            log_rank_0(f"[MLflow] Uploaded trace file: {os.path.basename(trace_file)}")
+        except Exception as e:
+            warning_rank_0(f"[MLflow] Failed to upload trace file {trace_file}: {e}")
+
+    log_rank_0(f"[MLflow] Uploaded {uploaded_count} trace files to '{artifact_path}'")
+    return uploaded_count
+
+
+def upload_log_files_to_mlflow(
+    mlflow_writer,
+    exp_root_path: str,
+    artifact_path: str = "logs",
+) -> int:
+    """
+    Upload all log files to MLflow as artifacts.
+
+    This function collects log files from all ranks and all log levels
+    and uploads them to MLflow. The directory structure is preserved
+    in the artifact path.
+
+    Args:
+        mlflow_writer: The MLflow module instance (from get_mlflow_writer())
+        exp_root_path: Root path of the experiment
+        artifact_path: MLflow artifact subdirectory for log files
+
+    Returns:
+        Number of log files uploaded
+    """
+    if mlflow_writer is None:
+        return 0
+
+    log_files = _get_all_log_files(exp_root_path)
+
+    if not log_files:
+        log_rank_0("[MLflow] No log files found to upload")
+        return 0
+
+    logs_base_dir = os.path.join(exp_root_path, "logs")
+    uploaded_count = 0
+
+    for log_file in log_files:
+        try:
+            # Preserve directory structure relative to logs base directory
+            rel_path = os.path.relpath(log_file, logs_base_dir)
+            artifact_subpath = (
+                os.path.join(artifact_path, os.path.dirname(rel_path))
+                if os.path.dirname(rel_path)
+                else artifact_path
+            )
+
+            mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
+            uploaded_count += 1
+        except Exception as e:
+            warning_rank_0(f"[MLflow] Failed to upload log file {log_file}: {e}")
+
+    log_rank_0(f"[MLflow] Uploaded {uploaded_count} log files to '{artifact_path}'")
+    return uploaded_count
+
+
+def upload_artifacts_to_mlflow(
+    mlflow_writer,
+    tensorboard_dir: Optional[str] = None,
+    exp_root_path: Optional[str] = None,
+    upload_traces: bool = True,
+    upload_logs: bool = True,
+) -> dict:
+    """
+    Upload all artifacts (trace files and log files) to MLflow.
+
+    This is the main entry point for uploading artifacts to MLflow.
+    It handles both trace files from profiling and log files from training.
+
+    Args:
+        mlflow_writer: The MLflow module instance (from get_mlflow_writer())
+        tensorboard_dir: Path to the tensorboard directory containing trace files
+        exp_root_path: Root path of the experiment for log files
+        upload_traces: Whether to upload trace files
+        upload_logs: Whether to upload log files
+
+    Returns:
+        Dictionary with counts of uploaded files:
+        {
+            "traces": <number of trace files uploaded>,
+            "logs": <number of log files uploaded>
+        }
+    """
+    if mlflow_writer is None:
+        log_rank_0("[MLflow] MLflow writer not available, skipping artifact upload")
+        return {"traces": 0, "logs": 0}
+
+    log_rank_0("[MLflow] Starting artifact upload to MLflow...")
+    log_rank_0(f"[MLflow] tensorboard_dir: {tensorboard_dir}")
+    log_rank_0(f"[MLflow] exp_root_path: {exp_root_path}")
+    log_rank_0(f"[MLflow] upload_traces: {upload_traces}, upload_logs: {upload_logs}")
+
+    result = {"traces": 0, "logs": 0}
+
+    if upload_traces and tensorboard_dir:
+        result["traces"] = upload_trace_files_to_mlflow(
+            mlflow_writer, tensorboard_dir, artifact_path="traces"
+        )
+
+    if upload_logs and exp_root_path:
+        result["logs"] = upload_log_files_to_mlflow(mlflow_writer, exp_root_path, artifact_path="logs")
+
+    log_rank_0(
+        f"[MLflow] Artifact upload complete: {result['traces']} trace files, {result['logs']} log files"
+    )
+
+    return result


The new mlflow_artifacts.py module lacks unit tests. Given that the repository has comprehensive test coverage for other megatron backend modules (as seen in tests/unit_tests/backends/megatron/), this module should also have tests to cover:

File discovery logic (_get_all_trace_files, _get_all_log_files)

Upload functions with various scenarios (no files, multiple files, error handling)

Glob escaping for special characters

Relative path handling

Tests would help ensure reliability, especially for edge cases like special characters in paths or missing directories.

Move MLflow artifact upload functions from global_vars.py to new mlflow_setup.py to reduce merge conflicts: - set_exp_root_path() - get_exp_root_path() - upload_mlflow_artifacts() global_vars.py now matches main, avoiding future conflicts when merging from main branch.

Keep run_pretrain.sh and run_slurm_pretrain.sh as main. Experiment paths can be configured via environment variables: - PRIMUS_TEAM, PRIMUS_USER, PRIMUS_EXP_NAME, PRIMUS_WORKSPACE

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Copilot · 2026-02-02T14:55:29Z

primus/backends/megatron/training/mlflow_artifacts.py

+    # Escape directory path to handle special characters like [] in experiment names
+    escaped_dir = glob.escape(tensorboard_dir)
+    for pattern in patterns:
+        trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))


The recursive glob pattern on line 50 could potentially be slow or resource-intensive if the tensorboard_dir contains a very deep directory structure or a large number of files. Consider adding a comment about potential performance implications, or optionally limiting the recursion depth if this becomes a concern in practice.

Suggested change

trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))

trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))

# Note: This recursive glob walks the entire tensorboard_dir tree, which may be

# expensive if the directory is very large or deeply nested. If this becomes

# a bottleneck in practice, consider constraining tensorboard_dir or introducing

# a limit on recursion depth.

Copilot · 2026-02-02T14:55:30Z

primus/backends/megatron/training/mlflow_artifacts.py

+    return log_files
+
+


The recursive glob pattern for log files (line 86) will follow symbolic links by default in Python's glob.glob(). If there are symlinks in the logs directory that point outside the intended log directory, this could potentially upload files from unintended locations. Consider using glob.glob(..., recursive=True) with an additional check using os.path.realpath() to ensure files are within the expected directory, or document this behavior if it's intentional.

Suggested change

return log_files

# Prevent symlinks inside logs_dir from escaping the intended directory

logs_dir_real = os.path.realpath(logs_dir)

filtered_log_files = []

for path in log_files:

real_path = os.path.realpath(path)

try:

common = os.path.commonpath([logs_dir_real, real_path])

except ValueError:

# On different drives or invalid paths; treat as outside logs_dir

common = None

if common == logs_dir_real:

filtered_log_files.append(path)

else:

warning_rank_0(f"Skipping log file outside logs directory: {path}")

return filtered_log_files

Copilot · 2026-02-02T14:55:30Z

primus/backends/megatron/training/mlflow_setup.py

+# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
+# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.


The copyright header states "Copyright (c) 2022, NVIDIA CORPORATION" but this is a newly created file in 2025. Since this file contains entirely new AMD code (as indicated by the PR), consider updating line 2 to reflect only AMD copyright, similar to mlflow_artifacts.py which correctly uses "Copyright (c) 2025, Advanced Micro Devices, Inc."

Suggested change

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.

# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.

Copilot · 2026-02-02T14:55:30Z

primus/modules/trainer/megatron/trainer.py

+            # Upload artifacts before ending the run
+            upload_mlflow_artifacts(
+                upload_traces=getattr(args, "mlflow_upload_traces", True),
+                upload_logs=getattr(args, "mlflow_upload_logs", True),
+            )


The duplicate code pattern for uploading MLflow artifacts appears in two locations (lines 1130-1134 and 1580-1583). Consider extracting this into a helper function to ensure consistency and maintainability. For example, create a function like finalize_mlflow_run() that handles both artifact upload and run ending.

Copilot AI review requested due to automatic review settings December 18, 2025 09:10

Copilot started reviewing on behalf of gphuang December 18, 2025 09:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

Copilot AI mentioned this pull request Dec 18, 2025

Move MLflow import to function scope to avoid import-time dependencies #441

Closed

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

13dfa81

Copilot AI review requested due to automatic review settings December 18, 2025 10:30

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/6-enable-mlflow-uploading branch from 3c149be to 13dfa81 Compare December 18, 2025 10:33

Update primus/modules/trainer/megatron/trainer.py

1f2e136

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

Update examples/run_pretrain.sh

d30b920

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings December 18, 2025 10:37

Copilot started reviewing on behalf of gphuang December 18, 2025 10:38 View session

Update primus/backends/megatron/training/mlflow_artifacts.py

b2da61b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI reviewed Dec 18, 2025

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

gphuang mentioned this pull request Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Open

gphuang and others added 2 commits December 18, 2025 15:15

Merge branch 'main' into feat/6-enable-mlflow-uploading

476c05d

Copilot AI review requested due to automatic review settings December 19, 2025 08:26

gphuang marked this pull request as ready for review December 19, 2025 08:26

gphuang requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners December 19, 2025 08:26

Copilot started reviewing on behalf of gphuang December 19, 2025 08:27 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings January 15, 2026 10:24

Copilot started reviewing on behalf of gphuang January 15, 2026 10:25 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

primus/modules/trainer/megatron/trainer.py Show resolved Hide resolved

primus/backends/megatron/training/global_vars.py Show resolved Hide resolved

primus/backends/megatron/training/mlflow_artifacts.py Show resolved Hide resolved

examples/run_slurm_pretrain.sh Outdated Show resolved Hide resolved

gphuang added 2 commits January 16, 2026 12:34

Merge branch 'main' into feat/6-enable-mlflow-uploading

5e01c59

Merge branch 'main' into feat/6-enable-mlflow-uploading

7488ccd

Copilot AI review requested due to automatic review settings January 19, 2026 07:51

Copilot started reviewing on behalf of gphuang January 19, 2026 07:52 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

gphuang added 2 commits January 20, 2026 09:56

Merge branch 'main' into feat/6-enable-mlflow-uploading

f5b2a1c

Merge branch 'main' into feat/6-enable-mlflow-uploading

c2999a9

Copilot AI review requested due to automatic review settings January 22, 2026 08:20

Copilot started reviewing on behalf of gphuang January 22, 2026 08:21 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

Merge branch 'main' into feat/6-enable-mlflow-uploading

e4c516c

wenxie-amd reviewed Jan 26, 2026

View reviewed changes

examples/run_pretrain.sh Outdated Show resolved Hide resolved

Merge branch 'main' into feat/6-enable-mlflow-uploading

e9202fd

Copilot AI review requested due to automatic review settings January 26, 2026 09:42

Copilot started reviewing on behalf of gphuang January 26, 2026 09:43 View session

Copilot AI reviewed Jan 26, 2026

View reviewed changes

gphuang and others added 2 commits January 30, 2026 11:44

Merge branch 'main' into feat/6-enable-mlflow-uploading

ddab02b

Merge branch 'main' into feat/6-enable-mlflow-uploading

e97365c

Copilot AI review requested due to automatic review settings February 2, 2026 12:27

Copilot started reviewing on behalf of gphuang February 2, 2026 12:28 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

gphuang added 2 commits February 2, 2026 12:43

Revert run script modifications

24bfa17

Keep run_pretrain.sh and run_slurm_pretrain.sh as main. Experiment paths can be configured via environment variables: - PRIMUS_TEAM, PRIMUS_USER, PRIMUS_EXP_NAME, PRIMUS_WORKSPACE

gphuang requested a review from Copilot February 2, 2026 14:43

Copilot started reviewing on behalf of gphuang February 2, 2026 14:44 View session

Revert run_local_pretrain.sh to main

6219602

Copilot AI reviewed Feb 2, 2026

View reviewed changes

		# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
		MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)

-# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
-MODEL_NAME=$(basename "${EXP:-unknown}" .yaml)
+# Set a default EXP if not provided, to align with run_local_pretrain.sh and avoid 'unknown_<ts>' names
+if [[ -z "${EXP:-}" ]]; then
+    export EXP="${SCRIPT_DIR}/megatron/exp_pretrain.yaml"
+fi
+# Extract model name from EXP config file path (e.g., deepseek_v2_lite-pretrain.yaml -> deepseek_v2_lite-pretrain)
+MODEL_NAME=$(basename "${EXP}" .yaml)

	from primus.modules.module_utils import log_rank_0, warning_rank_0
	from primus.modules.module_utils import log_rank_last as log_rank_0, warning_rank_last as warning_rank_0

-        trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
+        trace_files.extend(glob.glob(os.path.join(escaped_dir, pattern)))
+        # Note: This recursive glob walks the entire tensorboard_dir tree, which may be
+        # expensive if the directory is very large or deeply nested. If this becomes
+        # a bottleneck in practice, consider constraining tensorboard_dir or introducing
+        # a limit on recursion depth.

-    return log_files
+    # Prevent symlinks inside logs_dir from escaping the intended directory
+    logs_dir_real = os.path.realpath(logs_dir)
+    filtered_log_files = []
+    for path in log_files:
+        real_path = os.path.realpath(path)
+        try:
+            common = os.path.commonpath([logs_dir_real, real_path])
+        except ValueError:
+            # On different drives or invalid paths; treat as outside logs_dir
+            common = None
+        if common == logs_dir_real:
+            filtered_log_files.append(path)
+        else:
+            warning_rank_0(f"Skipping log file outside logs directory: {path}")
+    return filtered_log_files

		# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
		# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.

	# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
	# Modification Copyright© 2025 Advanced Micro Devices, Inc. All rights reserved.
	# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.

feat: Add MLflow artifact upload for traces and logs #440

Are you sure you want to change the base?

feat: Add MLflow artifact upload for traces and logs #440

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!