Skip to content

Conversation

@gphuang
Copy link

@gphuang gphuang commented Dec 18, 2025

feat: Add TraceLens integration for trace analysis with MLflow upload

Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.

Features

  • Generate TraceLens analysis reports (XLSX, CSV formats)
  • Auto-install TraceLens from GitHub if not present
  • Upload reports to MLflow artifacts/trace_analysis/
  • Configurable rank filtering and report limits

Config Options

  • mlflow_upload_tracelens_report: false # Enable TraceLens analysis
  • mlflow_tracelens_ranks: null # Ranks to analyze (null = all)
  • mlflow_tracelens_max_reports: null # Max reports (null = unlimited)
  • mlflow_tracelens_output_format: all # all, xlsx, or csv

- Add TraceLens trace analysis report generation (XLSX, CSV formats)
- Add mlflow_upload_tracelens_report config option (default: false)
- Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options
- Add mlflow_tracelens_output_format option (all, xlsx, csv)
- Auto-install TraceLens from GitHub if not present
- Upload analysis reports to MLflow artifacts/trace_analysis/
Copilot AI review requested due to automatic review settings December 18, 2025 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).

Key changes:

  • New module for MLflow artifact management with TraceLens integration
  • Automatic TraceLens installation from GitHub with fallback CSV generation
  • Configuration options to control trace analysis (ranks, max reports, output formats)

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.

File Description
primus/backends/megatron/training/mlflow_artifacts.py New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary
primus/backends/megatron/training/global_vars.py Adds import and wrapper function upload_mlflow_artifacts to expose artifact upload functionality
primus/modules/trainer/megatron/trainer.py Calls upload_mlflow_artifacts before ending MLflow run with configuration parameters from args
primus/configs/modules/megatron/primus_megatron_module.yaml Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis
Comments suppressed due to low confidence (2)

primus/backends/megatron/training/mlflow_artifacts.py:382

  • Variable dfs is not used.
            dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

primus/backends/megatron/training/mlflow_artifacts.py:370

  • This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
            dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 18, 2025 09:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings December 18, 2025 10:31
@gphuang gphuang force-pushed the feat/12-enable-tracelens-analysis branch from df2e40a to 2861bdf Compare December 18, 2025 10:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…config parser

Addresses Copilot review comment: if mlflow_tracelens_ranks is configured
as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would
receive a string instead of a list, causing _filter_traces_by_rank to
silently filter out all trace files.

Added ast.literal_eval() conversion in:
- generate_tracelens_reports()
- upload_tracelens_reports_to_mlflow()

Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice:
- Once for XLSX generation
- Once for CSV generation

Now when format is 'all', we call generate_perf_report_pytorch once
with both output_xlsx_path and output_csvs_dir parameters, parsing
the trace file only once and generating both formats from the same data.

This improves performance significantly for the common use case of
generating both report formats.
Copilot AI review requested due to automatic review settings December 18, 2025 12:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

After TraceLens reports are successfully uploaded to MLflow, the local
tracelens_reports directory is automatically cleaned up to save disk space.

This addresses the issue of temporary directories not being cleaned up
after artifact upload. The reports remain accessible in MLflow while
freeing up local storage.

Other directories checked:
- tensorboard_dir: Contains original trace files, NOT temporary
- exp_root_path/logs: Contains original log files, NOT temporary
- tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether
local TraceLens reports are removed after upload to MLflow.

Default: True (cleanup to save disk space)
Set to False to keep reports locally for inspection/debugging

Changes:
- Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow()
- Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow()
- Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true)
- Updated trainer to pass through the parameter

Use cases:
- True (default): Production runs, save disk space
- False: Development/debugging, keep local copies for inspection
Copilot AI review requested due to automatic review settings December 18, 2025 13:08
Copilot AI review requested due to automatic review settings January 15, 2026 09:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings January 16, 2026 10:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Copilot AI review requested due to automatic review settings January 20, 2026 08:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comment on lines +719 to +731
if ranks is not None and isinstance(ranks, str):
import ast

try:
ranks = ast.literal_eval(ranks)
if not isinstance(ranks, list):
log_rank_0(
f"[TraceLens] Warning: ranks evaluated to {type(ranks).__name__}, expected list. Using None."
)
ranks = None
except (ValueError, SyntaxError) as e:
log_rank_0(f"[TraceLens] Warning: Failed to parse ranks '{ranks}': {e}. Using None.")
ranks = None
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing the ranks parameter as a string using ast.literal_eval, the code doesn't handle the case where ranks could be a single integer instead of a list. If a user mistakenly provides mlflow_tracelens_ranks: 0 instead of mlflow_tracelens_ranks: [0], the evaluation succeeds but the type check fails, resetting ranks to None. Consider adding explicit handling for integer input to convert it to a single-element list.

Copilot uses AI. Check for mistakes.
Comment on lines +390 to +412
if output_format == "all":
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)

# First call: Generate XLSX only
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

# Second call: Generate CSVs only
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Check CSV outputs (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "all" format, the function calls generate_perf_report_pytorch twice (once for XLSX, once for CSV), which means parsing the trace file twice. This could be inefficient for large trace files. Consider modifying the logic to parse the trace once and generate both outputs if the TraceLens API supports it.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +245
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed openpyxl")
return True
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.

Suggested change
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed openpyxl")
return True
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
result = subprocess.run(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.DEVNULL,
stderr=subprocess.PIPE,
text=True,
check=True,
)
log_rank_0("[TraceLens] Successfully installed openpyxl")
return True
except subprocess.CalledProcessError as e:
stderr_output = e.stderr.strip() if e.stderr else "No error output from pip."
warning_rank_0(
f"[TraceLens] Failed to install openpyxl: {e}. Stderr: {stderr_output}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +396 to +406
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

# Second call: Generate CSVs only
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'dfs' is assigned but never used in the "all" format branch. This appears to be leftover from debugging or an incomplete implementation. Consider removing the assignment or using the value if it serves a purpose.

Copilot uses AI. Check for mistakes.
# CSV only: Multiple files in a subdirectory per rank
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'dfs' is assigned but never used. Consider removing the assignment or using the value if it serves a purpose.

Suggested change
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +278
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"git+https://github.com/AMD-AGI/TraceLens.git",
"-q",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing packages from GitHub using subprocess without verifying the package integrity (e.g., checking commit hash or release tag) could pose a security risk. Consider pinning to a specific version/tag or commit hash for TraceLens installation to ensure reproducibility and security.

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +278
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"git+https://github.com/AMD-AGI/TraceLens.git",
"-q",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 22, 2026 08:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.

Comment on lines 130 to 131
artifacts to MLflow. Only the rank that initialized MLflow (last rank)
should call this to avoid duplicate uploads.
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The comment states "Only the rank that initialized MLflow (last rank)" which is correct, but could be more precise by stating it's rank (world_size - 1) to match the implementation in global_vars.py line 71.

Suggested change
artifacts to MLflow. Only the rank that initialized MLflow (last rank)
should call this to avoid duplicate uploads.
artifacts to MLflow. Only the rank that initialized MLflow (rank
``world_size - 1``, i.e., the last rank) should call this to avoid
duplicate uploads.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +246
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed openpyxl")
return True
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
return False
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subprocess calls silence all output which makes troubleshooting difficult. When a pip install fails, users won't see helpful error messages. Consider using stderr=subprocess.STDOUT to capture errors in the exception, or at minimum log the stderr output when an error occurs.

Copilot uses AI. Check for mistakes.
Comment on lines +773 to +774
except Exception as e:
warning_rank_0(f"[TraceLens] Failed to cleanup reports directory: {e}")
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., IOError, OSError, shutil.Error) related to file operations rather than using a bare except clause.

Copilot uses AI. Check for mistakes.
Comment on lines +761 to +762
except Exception as e:
warning_rank_0(f"[MLflow] Failed to upload report {report_path}: {e}")
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.

Copilot uses AI. Check for mistakes.
Comment on lines +478 to +485
try:
# Load trace file
if trace_file.endswith(".gz"):
with gzip.open(trace_file, "rt", encoding="utf-8") as f:
trace_data = json.load(f)
else:
with open(trace_file, "r", encoding="utf-8") as f:
trace_data = json.load(f)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential security risk: The code opens user-provided trace files without validating their content or size. Large or malformed JSON files could cause memory exhaustion or denial of service. Consider adding file size limits and using streaming JSON parsers for large files.

Copilot uses AI. Check for mistakes.
Comment on lines +478 to +485
try:
# Load trace file
if trace_file.endswith(".gz"):
with gzip.open(trace_file, "rt", encoding="utf-8") as f:
trace_data = json.load(f)
else:
with open(trace_file, "r", encoding="utf-8") as f:
trace_data = json.load(f)
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concern: Loading entire trace files into memory could cause issues with large traces. PyTorch profiler traces can be hundreds of megabytes or larger in distributed training scenarios. Consider using streaming JSON parsers (like ijson) or add file size validation before attempting to load.

Copilot uses AI. Check for mistakes.
except json.JSONDecodeError as e:
warning_rank_0(f"[TraceLens] Failed to parse trace JSON: {e}")
return None
except Exception as e:
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling is too broad and swallows all exceptions including system exceptions like KeyboardInterrupt and SystemExit. Consider catching specific exceptions (e.g., json.JSONDecodeError, IOError, OSError) and allowing system exceptions to propagate.

Suggested change
except Exception as e:
except (OSError, ValueError, csv.Error, KeyError) as e:

Copilot uses AI. Check for mistakes.

mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath)
uploaded_count += 1
except Exception as e:
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.

Suggested change
except Exception as e:
except (OSError, IOError) as e:

Copilot uses AI. Check for mistakes.
Comment on lines +345 to +453
def generate_tracelens_report(
trace_file: str,
output_dir: str,
report_name: Optional[str] = None,
output_format: str = "all",
) -> List[str]:
"""
Generate a TraceLens analysis report for a single trace file.

Args:
trace_file: Path to the PyTorch profiler trace file (JSON/JSON.GZ)
output_dir: Directory to save the report
report_name: Optional custom name for the report (base name for CSVs)
output_format: Output format:
- "all" (default): Both XLSX and CSV files
- "xlsx": Single multi-tab Excel file with detailed analysis
- "csv": Multiple CSV files (kernels, memory, communication, etc.)

Returns:
List of paths to generated report files
"""
if not os.path.exists(trace_file):
warning_rank_0(f"[TraceLens] Trace file not found: {trace_file}")
return []

os.makedirs(output_dir, exist_ok=True)

# Generate base name from trace filename if not provided
if report_name is None:
base_name = os.path.basename(trace_file)
# Remove extensions like .json.gz
for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]:
if base_name.endswith(trace_ext):
base_name = base_name[: -len(trace_ext)]
break
report_name = base_name

try:
# Try using TraceLens Python API directly
from TraceLens.Reporting import generate_perf_report_pytorch

generated_files = []

# For "all" format: TraceLens uses either/or logic - if output_csvs_dir is set,
# it ONLY generates CSVs. So we need to call it twice for both formats.
if output_format == "all":
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)

# First call: Generate XLSX only
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

# Second call: Generate CSVs only
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Check CSV outputs (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure

elif output_format == "xlsx":
# XLSX only: Single file with multiple tabs
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

elif output_format == "csv":
# CSV only: Multiple files in a subdirectory per rank
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Collect all generated CSV files (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure

if generated_files:
return generated_files

warning_rank_0(f"[TraceLens] No output files generated for: {trace_file}")
return []

except ImportError:
log_rank_0("[TraceLens] TraceLens not available, using fallback CSV summary")
# Fallback to simple CSV summary
csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv")
return [csv_path] if csv_path else []

except Exception as e:
warning_rank_0(f"[TraceLens] Error generating report: {e}")
# Fallback to simple CSV summary
csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv")
return [csv_path] if csv_path else []

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for output_format parameter. The function expects 'all', 'xlsx', or 'csv' but doesn't validate the input. If an invalid value is provided (e.g., 'json'), the function will silently generate no output. Consider adding validation at the start of the function to raise a clear error for invalid format values.

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +380
for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]:
if base_name.endswith(trace_ext):
base_name = base_name[: -len(trace_ext)]
break
report_name = base_name
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: The extension removal loop on line 376-379 uses a break statement which stops after the first matching extension. However, if a filename ends with multiple extensions that don't exactly match any single pattern (e.g., a file ending in ".trace.json.gz" when the patterns check for ".pt.trace.json.gz"), only the first matched extension would be removed. The logic appears correct for the given patterns, but could be fragile if patterns change. Consider using rsplit or a more explicit approach.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 26, 2026 03:29
wenxie-amd
wenxie-amd previously approved these changes Jan 26, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

True if openpyxl is available, False otherwise
"""
try:
import openpyxl # noqa: F401
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'openpyxl' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import openpyxl as _ or adding a comment explaining this is an availability check.

Suggested change
import openpyxl # noqa: F401
import openpyxl as _openpyxl # availability check

Copilot uses AI. Check for mistakes.
True if TraceLens is available, False otherwise
"""
try:
import TraceLens # noqa: F401
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'TraceLens' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import TraceLens as _ or adding a comment explaining this is an availability check.

Suggested change
import TraceLens # noqa: F401
import TraceLens as _ # noqa: F401 # availability check

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 2, 2026 12:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment on lines +1131 to +1132
generate_tracelens_report=getattr(args, "generate_tracelens_report", False),
upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config file sets generate_tracelens_report: true and mlflow_upload_tracelens_report: true as defaults (lines 28-29), but in trainer.py, the code uses getattr(args, "generate_tracelens_report", False) and getattr(args, "mlflow_upload_tracelens_report", False) with False as the fallback default (lines 1131-1132).

This creates an inconsistency: if these config values are not properly loaded into args for some reason, the fallback defaults in the code (False) differ from the config file defaults (True). This could lead to confusing behavior where users expect TraceLens reports based on the config file but don't get them due to the getattr fallback.

Consider one of:

  1. Use True as the fallback default in getattr to match the config file
  2. Change the config defaults to false to match the code fallback
  3. Ensure the config values are always properly loaded so fallbacks aren't needed
  4. Add a comment explaining why the fallback defaults differ from config defaults
Suggested change
generate_tracelens_report=getattr(args, "generate_tracelens_report", False),
upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False),
generate_tracelens_report=getattr(args, "generate_tracelens_report", True),
upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", True),

Copilot uses AI. Check for mistakes.
Comment on lines +223 to +287
def _ensure_openpyxl_installed() -> bool:
"""
Ensure openpyxl is installed for XLSX generation.

Returns:
True if openpyxl is available, False otherwise
"""
try:
import openpyxl # noqa: F401

return True
except ImportError:
log_rank_0("[TraceLens] openpyxl not found, installing for XLSX support...")
try:
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "openpyxl", "-q"],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed openpyxl")
return True
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}")
return False


def _ensure_tracelens_installed() -> bool:
"""
Ensure TraceLens and its dependencies are installed.

TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens
XLSX generation requires openpyxl which is installed separately.

Returns:
True if TraceLens is available, False otherwise
"""
try:
import TraceLens # noqa: F401

log_rank_0("[TraceLens] TraceLens is already installed")
except ImportError:
log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
try:
# TraceLens is on GitHub, not PyPI
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"git+https://github.com/AMD-AGI/TraceLens.git",
"-q",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
return False

# Ensure openpyxl is installed for XLSX generation
_ensure_openpyxl_installed()

return True
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-installation of packages from GitHub (TraceLens) and PyPI (openpyxl) during runtime could cause issues in production environments:

  1. Security concern: Installing packages at runtime without explicit user control or version pinning could introduce supply chain vulnerabilities
  2. Reproducibility: Different runs might get different package versions
  3. Environment restrictions: Many production environments disable pip install or restrict network access
  4. Multi-process risks: In distributed training, multiple ranks might attempt concurrent installations, leading to race conditions or failures

Consider one or more of these alternatives:

  • Add TraceLens and openpyxl as optional dependencies in setup.py/requirements.txt
  • Document that these dependencies should be pre-installed for TraceLens functionality
  • Add a configuration option to disable auto-installation and just skip the feature if dependencies are missing
  • At minimum, add version pinning to the installation commands (e.g., "git+https://github.com/AMD-AGI/TraceLens.git@v1.0.0")

Copilot uses AI. Check for mistakes.
Comment on lines +259 to +287
try:
import TraceLens # noqa: F401

log_rank_0("[TraceLens] TraceLens is already installed")
except ImportError:
log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
try:
# TraceLens is on GitHub, not PyPI
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"git+https://github.com/AMD-AGI/TraceLens.git",
"-q",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
return False

# Ensure openpyxl is installed for XLSX generation
_ensure_openpyxl_installed()

return True
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function _ensure_tracelens_installed() has inconsistent return behavior. When TraceLens is already installed (line 260-262), the function logs a message but doesn't explicitly return True - it falls through to line 285-287 which ensures openpyxl is installed and then returns True. However, this behavior is implicit and could be fragile if the code is modified later.

For clarity and maintainability, add an explicit return True after line 262 or restructure the logic to make the control flow more explicit.

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +287
def _ensure_tracelens_installed() -> bool:
"""
Ensure TraceLens and its dependencies are installed.

TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens
XLSX generation requires openpyxl which is installed separately.

Returns:
True if TraceLens is available, False otherwise
"""
try:
import TraceLens # noqa: F401

log_rank_0("[TraceLens] TraceLens is already installed")
except ImportError:
log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...")
try:
# TraceLens is on GitHub, not PyPI
subprocess.check_call(
[
sys.executable,
"-m",
"pip",
"install",
"git+https://github.com/AMD-AGI/TraceLens.git",
"-q",
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub")
except subprocess.CalledProcessError as e:
warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}")
return False

# Ensure openpyxl is installed for XLSX generation
_ensure_openpyxl_installed()

return True
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In distributed training environments, if the TraceLens installation function is called by multiple ranks simultaneously (e.g., if upload_mlflow_artifacts is accidentally called from multiple ranks), there could be race conditions or conflicts when multiple processes try to install packages concurrently via pip.

While the current code only calls this from the rank that initializes MLflow (last rank), consider adding defensive checks or documentation to prevent issues if this function is called from the wrong rank. The comment on line 181 mentions "Only the rank that initialized MLflow (last rank) should call this" but there's no programmatic enforcement.

Consider adding a check at the beginning of the function to verify it's being called from the correct rank, or add a distributed barrier/lock mechanism if multiple ranks might call installation functions.

Copilot uses AI. Check for mistakes.
Comment on lines +396 to +435
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)

# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

# Second call: Generate CSVs only
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Check CSV outputs (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure

elif output_format == "xlsx":
# XLSX only: Single file with multiple tabs
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)

elif output_format == "csv":
# CSV only: Multiple files in a subdirectory per rank
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)

# Collect all generated CSV files (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure

Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate_perf_report_pytorch function is called without catching potential exceptions from the TraceLens library itself (lines 396, 406, 417, 428). While there is a broad exception handler at line 448, if generate_perf_report_pytorch raises an exception on the first call (line 396) when generating XLSX in "all" mode, the second call for CSV generation (line 406) will be skipped, leading to incomplete output.

Consider wrapping each generate_perf_report_pytorch call individually in try-except blocks, especially in the "all" format case, so that if XLSX generation fails, CSV generation can still be attempted, and vice versa. This would make the function more resilient to partial failures.

Suggested change
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)
# Second call: Generate CSVs only
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
# Check CSV outputs (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure
elif output_format == "xlsx":
# XLSX only: Single file with multiple tabs
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)
elif output_format == "csv":
# CSV only: Multiple files in a subdirectory per rank
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
# Collect all generated CSV files (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure
try:
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
except Exception as e:
warning_rank_0(
f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}"
)
else:
# Check XLSX output
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)
# Second call: Generate CSVs only
try:
generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
except Exception as e:
warning_rank_0(
f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}"
)
else:
# Check CSV outputs (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure
elif output_format == "xlsx":
# XLSX only: Single file with multiple tabs
xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx")
try:
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
except Exception as e:
warning_rank_0(
f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}"
)
else:
if os.path.exists(xlsx_path):
log_rank_0(
f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}"
)
generated_files.append(xlsx_path)
elif output_format == "csv":
# CSV only: Multiple files in a subdirectory per rank
csv_subdir = os.path.join(output_dir, report_name)
os.makedirs(csv_subdir, exist_ok=True)
try:
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
except Exception as e:
warning_rank_0(
f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}"
)
else:
# Collect all generated CSV files (escape path to handle [] characters in filenames)
csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv"))
if csv_files:
log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}")
generated_files.append(csv_subdir) # Upload directory to preserve structure

Copilot uses AI. Check for mistakes.
Move upload_mlflow_artifacts() from global_vars.py to new mlflow_setup.py
to reduce merge conflicts. Includes TraceLens report generation and upload
parameters.

global_vars.py now matches main, avoiding future conflicts when
merging from main branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants