feat: Add TraceLens integration for trace analysis with MLflow upload #439

gphuang · 2025-12-18T09:07:49Z

feat: Add TraceLens integration for trace analysis with MLflow upload

Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.

Features

Generate TraceLens analysis reports (XLSX, CSV formats)
Auto-install TraceLens from GitHub if not present
Upload reports to MLflow artifacts/trace_analysis/
Configurable rank filtering and report limits

Config Options

mlflow_upload_tracelens_report: false # Enable TraceLens analysis
mlflow_tracelens_ranks: null # Ranks to analyze (null = all)
mlflow_tracelens_max_reports: null # Max reports (null = unlimited)
mlflow_tracelens_output_format: all # all, xlsx, or csv

- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).

Key changes:

New module for MLflow artifact management with TraceLens integration
Automatic TraceLens installation from GitHub with fallback CSV generation
Configuration options to control trace analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File	Description
primus/backends/megatron/training/mlflow_artifacts.py	New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary
primus/backends/megatron/training/global_vars.py	Adds import and wrapper function `upload_mlflow_artifacts` to expose artifact upload functionality
primus/modules/trainer/megatron/trainer.py	Calls `upload_mlflow_artifacts` before ending MLflow run with configuration parameters from args
primus/configs/modules/megatron/primus_megatron_module.yaml	Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis

Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

Variable dfs is not used.

            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

This assignment to 'dfs' is unnecessary as it is redefined before this value is used.

            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/configs/modules/megatron/primus_megatron_module.yaml

primus/backends/megatron/training/global_vars.py

primus/backends/megatron/training/mlflow_artifacts.py

…N files

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

primus/modules/trainer/megatron/trainer.py

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.

When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup

Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

primus/backends/megatron/training/mlflow_artifacts.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Copilot · 2026-01-20T08:14:12Z

primus/backends/megatron/training/mlflow_artifacts.py

+    if ranks is not None and isinstance(ranks, str):
+        import ast
+
+        try:
+            ranks = ast.literal_eval(ranks)
+            if not isinstance(ranks, list):
+                log_rank_0(
+                    f"[TraceLens] Warning: ranks evaluated to {type(ranks).__name__}, expected list. Using None."
+                )
+                ranks = None
+        except (ValueError, SyntaxError) as e:
+            log_rank_0(f"[TraceLens] Warning: Failed to parse ranks '{ranks}': {e}. Using None.")
+            ranks = None


When parsing the ranks parameter as a string using ast.literal_eval, the code doesn't handle the case where ranks could be a single integer instead of a list. If a user mistakenly provides mlflow_tracelens_ranks: 0 instead of mlflow_tracelens_ranks: [0], the evaluation succeeds but the type check fails, resetting ranks to None. Consider adding explicit handling for integer input to convert it to a single-element list.

Copilot · 2026-01-20T08:14:12Z

primus/backends/megatron/training/mlflow_artifacts.py

+        if output_format == "all":
+            xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
+            csv_subdir = os.path.join(output_dir, report_name)
+            os.makedirs(csv_subdir, exist_ok=True)
+
+            # First call: Generate XLSX only
+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+
+            # Check XLSX output
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+            # Second call: Generate CSVs only
+            generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
+
+            # Check CSV outputs (escape path to handle [] characters in filenames)
+            csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
+            if csv_files:
+                log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
+                generated_files.append(csv_subdir)  # Upload directory to preserve structure


For the "all" format, the function calls generate_perf_report_pytorch twice (once for XLSX, once for CSV), which means parsing the trace file twice. This could be inefficient for large trace files. Consider modifying the logic to parse the trace once and generate both outputs if the TraceLens API supports it.

Copilot · 2026-01-20T08:14:12Z

primus/backends/megatron/training/mlflow_artifacts.py

+            subprocess.check_call(
+                [sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed openpyxl")
+            return True
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")


The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.

Suggested change

subprocess.check_call(

[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],

stdout=subprocess.DEVNULL,

stderr=subprocess.DEVNULL,

)

log_rank_0("[TraceLens] Successfully installed openpyxl")

return True

except subprocess.CalledProcessError as e:

warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")

result = subprocess.run(

[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],

stdout=subprocess.DEVNULL,

stderr=subprocess.PIPE,

text=True,

check=True,

)

log_rank_0("[TraceLens] Successfully installed openpyxl")

return True

except subprocess.CalledProcessError as e:

stderr_output = e.stderr.strip() if e.stderr else "No error output from pip."

warning_rank_0(

f"[TraceLens] Failed to install openpyxl: {e}. Stderr: {stderr_output}"

)

Copilot · 2026-01-20T08:14:13Z

primus/backends/megatron/training/mlflow_artifacts.py

+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+
+            # Check XLSX output
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+            # Second call: Generate CSVs only
+            generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)


The variable 'dfs' is assigned but never used in the "all" format branch. This appears to be leftover from debugging or an incomplete implementation. Consider removing the assignment or using the value if it serves a purpose.

Copilot · 2026-01-20T08:14:13Z

primus/backends/megatron/training/mlflow_artifacts.py

+            # CSV only: Multiple files in a subdirectory per rank
+            csv_subdir = os.path.join(output_dir, report_name)
+            os.makedirs(csv_subdir, exist_ok=True)
+            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)


The variable 'dfs' is assigned but never used. Consider removing the assignment or using the value if it serves a purpose.

Suggested change

dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

Copilot · 2026-01-20T08:14:13Z

primus/backends/megatron/training/mlflow_artifacts.py

+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    "git+https://github.com/AMD-AGI/TraceLens.git",
+                    "-q",
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )


Installing packages from GitHub using subprocess without verifying the package integrity (e.g., checking commit hash or release tag) could pose a security risk. Consider pinning to a specific version/tag or commit hash for TraceLens installation to ensure reproducibility and security.

Copilot · 2026-01-20T08:14:13Z

primus/backends/megatron/training/mlflow_artifacts.py

+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    "git+https://github.com/AMD-AGI/TraceLens.git",
+                    "-q",
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )


The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

Copilot · 2026-01-22T08:29:18Z

primus/backends/megatron/training/global_vars.py

+    artifacts to MLflow. Only the rank that initialized MLflow (last rank)
+    should call this to avoid duplicate uploads.


Documentation inconsistency: The comment states "Only the rank that initialized MLflow (last rank)" which is correct, but could be more precise by stating it's rank (world_size - 1) to match the implementation in global_vars.py line 71.

Suggested change

artifacts to MLflow. Only the rank that initialized MLflow (last rank)

should call this to avoid duplicate uploads.

artifacts to MLflow. Only the rank that initialized MLflow (rank

``world_size - 1``, i.e., the last rank) should call this to avoid

duplicate uploads.

Copilot · 2026-01-22T08:29:19Z

primus/backends/megatron/training/mlflow_artifacts.py

+            subprocess.check_call(
+                [sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed openpyxl")
+            return True
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
+            return False


Subprocess calls silence all output which makes troubleshooting difficult. When a pip install fails, users won't see helpful error messages. Consider using stderr=subprocess.STDOUT to capture errors in the exception, or at minimum log the stderr output when an error occurs.

Copilot · 2026-01-22T08:29:19Z

primus/backends/megatron/training/mlflow_artifacts.py

+        except Exception as e:
+            warning_rank_0(f"[TraceLens] Failed to cleanup reports directory: {e}")


The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., IOError, OSError, shutil.Error) related to file operations rather than using a bare except clause.

Copilot · 2026-01-22T08:29:19Z

primus/backends/megatron/training/mlflow_artifacts.py

+        except Exception as e:
+            warning_rank_0(f"[MLflow] Failed to upload report {report_path}: {e}")


The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.

Copilot · 2026-01-22T08:29:20Z

primus/backends/megatron/training/mlflow_artifacts.py

+    try:
+        # Load trace file
+        if trace_file.endswith(".gz"):
+            with gzip.open(trace_file, "rt", encoding="utf-8") as f:
+                trace_data = json.load(f)
+        else:
+            with open(trace_file, "r", encoding="utf-8") as f:
+                trace_data = json.load(f)


Potential security risk: The code opens user-provided trace files without validating their content or size. Large or malformed JSON files could cause memory exhaustion or denial of service. Consider adding file size limits and using streaming JSON parsers for large files.

Copilot · 2026-01-22T08:29:21Z

primus/backends/megatron/training/mlflow_artifacts.py

+    try:
+        # Load trace file
+        if trace_file.endswith(".gz"):
+            with gzip.open(trace_file, "rt", encoding="utf-8") as f:
+                trace_data = json.load(f)
+        else:
+            with open(trace_file, "r", encoding="utf-8") as f:
+                trace_data = json.load(f)


Performance concern: Loading entire trace files into memory could cause issues with large traces. PyTorch profiler traces can be hundreds of megabytes or larger in distributed training scenarios. Consider using streaming JSON parsers (like ijson) or add file size validation before attempting to load.

Copilot · 2026-01-22T08:29:21Z

primus/backends/megatron/training/mlflow_artifacts.py

+    except json.JSONDecodeError as e:
+        warning_rank_0(f"[TraceLens] Failed to parse trace JSON: {e}")
+        return None
+    except Exception as e:


The exception handling is too broad and swallows all exceptions including system exceptions like KeyboardInterrupt and SystemExit. Consider catching specific exceptions (e.g., json.JSONDecodeError, IOError, OSError) and allowing system exceptions to propagate.

Suggested change

except Exception as e:

except (OSError, ValueError, csv.Error, KeyError) as e:

Copilot · 2026-01-22T08:29:21Z

primus/backends/megatron/training/mlflow_artifacts.py

+
+            mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
+            uploaded_count += 1
+        except Exception as e:


The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.

Suggested change

except Exception as e:

except (OSError, IOError) as e:

Copilot · 2026-01-22T08:29:22Z

primus/backends/megatron/training/mlflow_artifacts.py

+def generate_tracelens_report(
+    trace_file: str,
+    output_dir: str,
+    report_name: Optional[str] = None,
+    output_format: str = "all",
+) -> List[str]:
+    """
+    Generate a TraceLens analysis report for a single trace file.
+
+    Args:
+        trace_file: Path to the PyTorch profiler trace file (JSON/JSON.GZ)
+        output_dir: Directory to save the report
+        report_name: Optional custom name for the report (base name for CSVs)
+        output_format: Output format:
+                      - "all" (default): Both XLSX and CSV files
+                      - "xlsx": Single multi-tab Excel file with detailed analysis
+                      - "csv": Multiple CSV files (kernels, memory, communication, etc.)
+
+    Returns:
+        List of paths to generated report files
+    """
+    if not os.path.exists(trace_file):
+        warning_rank_0(f"[TraceLens] Trace file not found: {trace_file}")
+        return []
+
+    os.makedirs(output_dir, exist_ok=True)
+
+    # Generate base name from trace filename if not provided
+    if report_name is None:
+        base_name = os.path.basename(trace_file)
+        # Remove extensions like .json.gz
+        for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]:
+            if base_name.endswith(trace_ext):
+                base_name = base_name[: -len(trace_ext)]
+                break
+        report_name = base_name
+
+    try:
+        # Try using TraceLens Python API directly
+        from TraceLens.Reporting import generate_perf_report_pytorch
+
+        generated_files = []
+
+        # For "all" format: TraceLens uses either/or logic - if output_csvs_dir is set,
+        # it ONLY generates CSVs. So we need to call it twice for both formats.
+        if output_format == "all":
+            xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
+            csv_subdir = os.path.join(output_dir, report_name)
+            os.makedirs(csv_subdir, exist_ok=True)
+
+            # First call: Generate XLSX only
+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+
+            # Check XLSX output
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+            # Second call: Generate CSVs only
+            generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
+
+            # Check CSV outputs (escape path to handle [] characters in filenames)
+            csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
+            if csv_files:
+                log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
+                generated_files.append(csv_subdir)  # Upload directory to preserve structure
+
+        elif output_format == "xlsx":
+            # XLSX only: Single file with multiple tabs
+            xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+        elif output_format == "csv":
+            # CSV only: Multiple files in a subdirectory per rank
+            csv_subdir = os.path.join(output_dir, report_name)
+            os.makedirs(csv_subdir, exist_ok=True)
+            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
+
+            # Collect all generated CSV files (escape path to handle [] characters in filenames)
+            csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
+            if csv_files:
+                log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
+                generated_files.append(csv_subdir)  # Upload directory to preserve structure
+
+        if generated_files:
+            return generated_files
+
+        warning_rank_0(f"[TraceLens] No output files generated for: {trace_file}")
+        return []
+
+    except ImportError:
+        log_rank_0("[TraceLens] TraceLens not available, using fallback CSV summary")
+        # Fallback to simple CSV summary
+        csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv")
+        return [csv_path] if csv_path else []
+
+    except Exception as e:
+        warning_rank_0(f"[TraceLens] Error generating report: {e}")
+        # Fallback to simple CSV summary
+        csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv")
+        return [csv_path] if csv_path else []
+


Missing input validation for output_format parameter. The function expects 'all', 'xlsx', or 'csv' but doesn't validate the input. If an invalid value is provided (e.g., 'json'), the function will silently generate no output. Consider adding validation at the start of the function to raise a clear error for invalid format values.

Copilot · 2026-01-22T08:29:22Z

primus/backends/megatron/training/mlflow_artifacts.py

+        for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]:
+            if base_name.endswith(trace_ext):
+                base_name = base_name[: -len(trace_ext)]
+                break
+        report_name = base_name


Potential bug: The extension removal loop on line 376-379 uses a break statement which stops after the first matching extension. However, if a filename ends with multiple extensions that don't exactly match any single pattern (e.g., a file ending in ".trace.json.gz" when the patterns check for ".pt.trace.json.gz"), only the first matched extension would be removed. The logic appears correct for the given patterns, but could be fragile if patterns change. Consider using rsplit or a more explicit approach.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot · 2026-01-26T03:30:19Z

primus/backends/megatron/training/mlflow_artifacts.py

+        True if openpyxl is available, False otherwise
+    """
+    try:
+        import openpyxl  # noqa: F401


The variable name 'openpyxl' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import openpyxl as _ or adding a comment explaining this is an availability check.

Suggested change

import openpyxl # noqa: F401

import openpyxl as _openpyxl # availability check

Copilot · 2026-01-26T03:30:19Z

primus/backends/megatron/training/mlflow_artifacts.py

+        True if TraceLens is available, False otherwise
+    """
+    try:
+        import TraceLens  # noqa: F401


The variable name 'TraceLens' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import TraceLens as _ or adding a comment explaining this is an availability check.

Suggested change

import TraceLens # noqa: F401

import TraceLens as _ # noqa: F401 # availability check

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Copilot · 2026-02-02T12:41:46Z

primus/modules/trainer/megatron/trainer.py

+                generate_tracelens_report=getattr(args, "generate_tracelens_report", False),
+                upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),


The config file sets generate_tracelens_report: true and mlflow_upload_tracelens_report: true as defaults (lines 28-29), but in trainer.py, the code uses getattr(args, "generate_tracelens_report", False) and getattr(args, "mlflow_upload_tracelens_report", False) with False as the fallback default (lines 1131-1132).

This creates an inconsistency: if these config values are not properly loaded into args for some reason, the fallback defaults in the code (False) differ from the config file defaults (True). This could lead to confusing behavior where users expect TraceLens reports based on the config file but don't get them due to the getattr fallback.

Consider one of:

Use True as the fallback default in getattr to match the config file

Change the config defaults to false to match the code fallback

Ensure the config values are always properly loaded so fallbacks aren't needed

Add a comment explaining why the fallback defaults differ from config defaults

Suggested change

generate_tracelens_report=getattr(args, "generate_tracelens_report", False),

upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),

generate_tracelens_report=getattr(args, "generate_tracelens_report", True),

upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", True),

Copilot · 2026-02-02T12:41:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+def _ensure_openpyxl_installed() -> bool:
+    """
+    Ensure openpyxl is installed for XLSX generation.
+
+    Returns:
+        True if openpyxl is available, False otherwise
+    """
+    try:
+        import openpyxl  # noqa: F401
+
+        return True
+    except ImportError:
+        log_rank_0("[TraceLens] openpyxl not found, installing for XLSX support...")
+        try:
+            subprocess.check_call(
+                [sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed openpyxl")
+            return True
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
+            return False
+
+
+def _ensure_tracelens_installed() -> bool:
+    """
+    Ensure TraceLens and its dependencies are installed.
+
+    TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens
+    XLSX generation requires openpyxl which is installed separately.
+
+    Returns:
+        True if TraceLens is available, False otherwise
+    """
+    try:
+        import TraceLens  # noqa: F401
+
+        log_rank_0("[TraceLens] TraceLens is already installed")
+    except ImportError:
+        log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
+        try:
+            # TraceLens is on GitHub, not PyPI
+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    "git+https://github.com/AMD-AGI/TraceLens.git",
+                    "-q",
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
+            return False
+
+    # Ensure openpyxl is installed for XLSX generation
+    _ensure_openpyxl_installed()
+
+    return True


The auto-installation of packages from GitHub (TraceLens) and PyPI (openpyxl) during runtime could cause issues in production environments:

Security concern: Installing packages at runtime without explicit user control or version pinning could introduce supply chain vulnerabilities

Reproducibility: Different runs might get different package versions

Environment restrictions: Many production environments disable pip install or restrict network access

Multi-process risks: In distributed training, multiple ranks might attempt concurrent installations, leading to race conditions or failures

Consider one or more of these alternatives:

Add TraceLens and openpyxl as optional dependencies in setup.py/requirements.txt

Document that these dependencies should be pre-installed for TraceLens functionality

Add a configuration option to disable auto-installation and just skip the feature if dependencies are missing

At minimum, add version pinning to the installation commands (e.g., "git+https://github.com/AMD-AGI/TraceLens.git@v1.0.0")

Copilot · 2026-02-02T12:41:46Z

primus/backends/megatron/training/mlflow_artifacts.py

+    try:
+        import TraceLens  # noqa: F401
+
+        log_rank_0("[TraceLens] TraceLens is already installed")
+    except ImportError:
+        log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
+        try:
+            # TraceLens is on GitHub, not PyPI
+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    "git+https://github.com/AMD-AGI/TraceLens.git",
+                    "-q",
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
+            return False
+
+    # Ensure openpyxl is installed for XLSX generation
+    _ensure_openpyxl_installed()
+
+    return True


The function _ensure_tracelens_installed() has inconsistent return behavior. When TraceLens is already installed (line 260-262), the function logs a message but doesn't explicitly return True - it falls through to line 285-287 which ensures openpyxl is installed and then returns True. However, this behavior is implicit and could be fragile if the code is modified later.

For clarity and maintainability, add an explicit return True after line 262 or restructure the logic to make the control flow more explicit.

Copilot · 2026-02-02T12:41:47Z

primus/backends/megatron/training/mlflow_artifacts.py

+def _ensure_tracelens_installed() -> bool:
+    """
+    Ensure TraceLens and its dependencies are installed.
+
+    TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens
+    XLSX generation requires openpyxl which is installed separately.
+
+    Returns:
+        True if TraceLens is available, False otherwise
+    """
+    try:
+        import TraceLens  # noqa: F401
+
+        log_rank_0("[TraceLens] TraceLens is already installed")
+    except ImportError:
+        log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
+        try:
+            # TraceLens is on GitHub, not PyPI
+            subprocess.check_call(
+                [
+                    sys.executable,
+                    "-m",
+                    "pip",
+                    "install",
+                    "git+https://github.com/AMD-AGI/TraceLens.git",
+                    "-q",
+                ],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+            )
+            log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
+        except subprocess.CalledProcessError as e:
+            warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
+            return False
+
+    # Ensure openpyxl is installed for XLSX generation
+    _ensure_openpyxl_installed()
+
+    return True


In distributed training environments, if the TraceLens installation function is called by multiple ranks simultaneously (e.g., if upload_mlflow_artifacts is accidentally called from multiple ranks), there could be race conditions or conflicts when multiple processes try to install packages concurrently via pip.

While the current code only calls this from the rank that initializes MLflow (last rank), consider adding defensive checks or documentation to prevent issues if this function is called from the wrong rank. The comment on line 181 mentions "Only the rank that initialized MLflow (last rank) should call this" but there's no programmatic enforcement.

Consider adding a check at the beginning of the function to verify it's being called from the correct rank, or add a distributed barrier/lock mechanism if multiple ranks might call installation functions.

Copilot · 2026-02-02T12:41:47Z

primus/backends/megatron/training/mlflow_artifacts.py

+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+
+            # Check XLSX output
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+            # Second call: Generate CSVs only
+            generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
+
+            # Check CSV outputs (escape path to handle [] characters in filenames)
+            csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
+            if csv_files:
+                log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
+                generated_files.append(csv_subdir)  # Upload directory to preserve structure
+
+        elif output_format == "xlsx":
+            # XLSX only: Single file with multiple tabs
+            xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
+            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
+            if os.path.exists(xlsx_path):
+                log_rank_0(
+                    f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
+                )
+                generated_files.append(xlsx_path)
+
+        elif output_format == "csv":
+            # CSV only: Multiple files in a subdirectory per rank
+            csv_subdir = os.path.join(output_dir, report_name)
+            os.makedirs(csv_subdir, exist_ok=True)
+            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
+
+            # Collect all generated CSV files (escape path to handle [] characters in filenames)
+            csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
+            if csv_files:
+                log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
+                generated_files.append(csv_subdir)  # Upload directory to preserve structure
+


The generate_perf_report_pytorch function is called without catching potential exceptions from the TraceLens library itself (lines 396, 406, 417, 428). While there is a broad exception handler at line 448, if generate_perf_report_pytorch raises an exception on the first call (line 396) when generating XLSX in "all" mode, the second call for CSV generation (line 406) will be skipped, leading to incomplete output.

Consider wrapping each generate_perf_report_pytorch call individually in try-except blocks, especially in the "all" format case, so that if XLSX generation fails, CSV generation can still be attempted, and vice versa. This would make the function more resilient to partial failures.

Suggested change

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

# Check XLSX output

if os.path.exists(xlsx_path):

log_rank_0(

f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"

)

generated_files.append(xlsx_path)

# Second call: Generate CSVs only

generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Check CSV outputs (escape path to handle [] characters in filenames)

csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))

if csv_files:

log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")

generated_files.append(csv_subdir) # Upload directory to preserve structure

elif output_format == "xlsx":

# XLSX only: Single file with multiple tabs

xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

if os.path.exists(xlsx_path):

log_rank_0(

f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"

)

generated_files.append(xlsx_path)

elif output_format == "csv":

# CSV only: Multiple files in a subdirectory per rank

csv_subdir = os.path.join(output_dir, report_name)

os.makedirs(csv_subdir, exist_ok=True)

dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Collect all generated CSV files (escape path to handle [] characters in filenames)

csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))

if csv_files:

log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")

generated_files.append(csv_subdir) # Upload directory to preserve structure

try:

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

except Exception as e:

warning_rank_0(

f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}"

)

else:

# Check XLSX output

if os.path.exists(xlsx_path):

log_rank_0(

f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"

)

generated_files.append(xlsx_path)

# Second call: Generate CSVs only

try:

generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

except Exception as e:

warning_rank_0(

f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}"

)

else:

# Check CSV outputs (escape path to handle [] characters in filenames)

csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))

if csv_files:

log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")

generated_files.append(csv_subdir) # Upload directory to preserve structure

elif output_format == "xlsx":

# XLSX only: Single file with multiple tabs

xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")

try:

dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

except Exception as e:

warning_rank_0(

f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}"

)

else:

if os.path.exists(xlsx_path):

log_rank_0(

f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"

)

generated_files.append(xlsx_path)

elif output_format == "csv":

# CSV only: Multiple files in a subdirectory per rank

csv_subdir = os.path.join(output_dir, report_name)

os.makedirs(csv_subdir, exist_ok=True)

try:

dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

except Exception as e:

warning_rank_0(

f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}"

)

else:

# Collect all generated CSV files (escape path to handle [] characters in filenames)

csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))

if csv_files:

log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")

generated_files.append(csv_subdir) # Upload directory to preserve structure

Move upload_mlflow_artifacts() from global_vars.py to new mlflow_setup.py to reduce merge conflicts. Includes TraceLens report generation and upload parameters. global_vars.py now matches main, avoiding future conflicts when merging from main branch.

Copilot AI review requested due to automatic review settings December 18, 2025 09:07

Copilot started reviewing on behalf of gphuang December 18, 2025 09:08 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 09:23

docs: Fix TraceLens CSV format description (multiple files per rank)

bbfa9d3

fix: Remove unsupported HTML format option from TraceLens

0759122

Copilot AI review requested due to automatic review settings December 18, 2025 09:25

Copilot started reviewing on behalf of gphuang December 18, 2025 09:26 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

fix: Use specific trace file patterns to avoid matching unrelated JSO…

4c908e5

…N files

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang requested a review from Copilot December 18, 2025 10:10

Copilot started reviewing on behalf of gphuang December 18, 2025 10:11 View session

Copilot AI reviewed Dec 18, 2025

View reviewed changes

docs: Clarify MLflow upload defaults are opt-out when MLflow enabled

2861bdf

Copilot AI review requested due to automatic review settings December 18, 2025 10:31

Copilot started reviewing on behalf of gphuang December 18, 2025 10:31 View session

gphuang force-pushed the feat/12-enable-tracelens-analysis branch from df2e40a to 2861bdf Compare December 18, 2025 10:33

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 11:26

Copilot AI review requested due to automatic review settings December 18, 2025 12:39

Copilot started reviewing on behalf of gphuang December 18, 2025 12:40 View session

github-code-quality bot found potential problems Dec 18, 2025

View reviewed changes

primus/backends/megatron/training/mlflow_artifacts.py Fixed Show fixed Hide fixed

Copilot AI reviewed Dec 18, 2025

View reviewed changes

gphuang added 2 commits December 18, 2025 12:56

Copilot AI review requested due to automatic review settings December 18, 2025 13:08

Copilot AI review requested due to automatic review settings January 15, 2026 09:00

Copilot started reviewing on behalf of gphuang January 15, 2026 09:01 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

gphuang and others added 2 commits January 15, 2026 10:33

minor fix

fdc8f51

Merge branch 'main' into feat/12-enable-tracelens-analysis

6140b2b

Copilot AI review requested due to automatic review settings January 16, 2026 10:35

Copilot started reviewing on behalf of gphuang January 16, 2026 10:36 View session

Copilot AI reviewed Jan 16, 2026

View reviewed changes

gphuang added 2 commits January 19, 2026 10:07

Merge branch 'main' into feat/12-enable-tracelens-analysis

ada9c01

Merge branch 'main' into feat/12-enable-tracelens-analysis

11ffb51

Copilot AI review requested due to automatic review settings January 20, 2026 08:08

Copilot started reviewing on behalf of gphuang January 20, 2026 08:09 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

gphuang added 2 commits January 20, 2026 16:28

Merge branch 'main' into feat/12-enable-tracelens-analysis

77e03d1

Merge branch 'main' into feat/12-enable-tracelens-analysis

2494dd5

Copilot AI review requested due to automatic review settings January 22, 2026 08:21

Copilot started reviewing on behalf of gphuang January 22, 2026 08:21 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

gphuang and others added 2 commits January 23, 2026 10:28

Merge branch 'main' into feat/12-enable-tracelens-analysis

6eebca4

Merge branch 'main' into feat/12-enable-tracelens-analysis

d931053

Copilot AI review requested due to automatic review settings January 26, 2026 03:29

wenxie-amd previously approved these changes Jan 26, 2026

View reviewed changes

Copilot AI reviewed Jan 26, 2026

View reviewed changes

gphuang and others added 2 commits January 30, 2026 11:44

Merge branch 'main' into feat/12-enable-tracelens-analysis

968ee8b

Merge branch 'main' into feat/12-enable-tracelens-analysis

f79efc0

Copilot AI review requested due to automatic review settings February 2, 2026 12:34

gphuang dismissed wenxie-amd’s stale review via f79efc0 February 2, 2026 12:34

Copilot started reviewing on behalf of gphuang February 2, 2026 12:35 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

	dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
	generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

		artifacts to MLflow. Only the rank that initialized MLflow (last rank)
		should call this to avoid duplicate uploads.

		except Exception as e:
		warning_rank_0(f"[TraceLens] Failed to cleanup reports directory: {e}")

		except Exception as e:
		warning_rank_0(f"[MLflow] Failed to upload report {report_path}: {e}")

	except Exception as e:
	except (OSError, ValueError, csv.Error, KeyError) as e:

	import openpyxl # noqa: F401
	import openpyxl as _openpyxl # availability check

	import TraceLens # noqa: F401
	import TraceLens as _ # noqa: F401 # availability check

		generate_tracelens_report=getattr(args, "generate_tracelens_report", False),
		upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Are you sure you want to change the base?

feat: Add TraceLens integration for trace analysis with MLflow upload #439

Uh oh!

Conversation

gphuang commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!