-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Add TraceLens integration for trace analysis with MLflow upload #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add TraceLens trace analysis report generation (XLSX, CSV formats) - Add mlflow_upload_tracelens_report config option (default: false) - Add mlflow_tracelens_ranks, mlflow_tracelens_max_reports options - Add mlflow_tracelens_output_format option (all, xlsx, csv) - Auto-install TraceLens from GitHub if not present - Upload analysis reports to MLflow artifacts/trace_analysis/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds TraceLens integration to automatically generate performance analysis reports from PyTorch profiler traces and upload them to MLflow. TraceLens is auto-installed from GitHub if not present, and users can configure rank filtering, report limits, and output formats (XLSX/CSV/HTML).
Key changes:
- New module for MLflow artifact management with TraceLens integration
- Automatic TraceLens installation from GitHub with fallback CSV generation
- Configuration options to control trace analysis (ranks, max reports, output formats)
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| primus/backends/megatron/training/mlflow_artifacts.py | New 725-line module implementing trace/log file uploads and TraceLens report generation with fallback CSV summary |
| primus/backends/megatron/training/global_vars.py | Adds import and wrapper function upload_mlflow_artifacts to expose artifact upload functionality |
| primus/modules/trainer/megatron/trainer.py | Calls upload_mlflow_artifacts before ending MLflow run with configuration parameters from args |
| primus/configs/modules/megatron/primus_megatron_module.yaml | Adds 6 new configuration options for controlling trace/log uploads and TraceLens analysis |
Comments suppressed due to low confidence (2)
primus/backends/megatron/training/mlflow_artifacts.py:382
- Variable dfs is not used.
dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path)
primus/backends/megatron/training/mlflow_artifacts.py:370
- This assignment to 'dfs' is unnecessary as it is redefined before this value is used.
dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
df2e40a to
2861bdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…config parser Addresses Copilot review comment: if mlflow_tracelens_ranks is configured as a string in YAML (e.g., '[0,8]' instead of [0, 8]), the code would receive a string instead of a list, causing _filter_traces_by_rank to silently filter out all trace files. Added ast.literal_eval() conversion in: - generate_tracelens_reports() - upload_tracelens_reports_to_mlflow() Falls back to None (process all ranks) with a warning if parsing fails.
When output_format='all', previously the trace file was parsed twice: - Once for XLSX generation - Once for CSV generation Now when format is 'all', we call generate_perf_report_pytorch once with both output_xlsx_path and output_csvs_dir parameters, parsing the trace file only once and generating both formats from the same data. This improves performance significantly for the common use case of generating both report formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
After TraceLens reports are successfully uploaded to MLflow, the local tracelens_reports directory is automatically cleaned up to save disk space. This addresses the issue of temporary directories not being cleaned up after artifact upload. The reports remain accessible in MLflow while freeing up local storage. Other directories checked: - tensorboard_dir: Contains original trace files, NOT temporary - exp_root_path/logs: Contains original log files, NOT temporary - tracelens_reports: Processed reports uploaded to MLflow, safe to cleanup
Added mlflow_tracelens_cleanup_after_upload parameter to control whether local TraceLens reports are removed after upload to MLflow. Default: True (cleanup to save disk space) Set to False to keep reports locally for inspection/debugging Changes: - Added cleanup_after_upload parameter to upload_tracelens_reports_to_mlflow() - Added tracelens_cleanup_after_upload to upload_artifacts_to_mlflow() - Added mlflow_tracelens_cleanup_after_upload config in YAML (default: true) - Updated trainer to pass through the parameter Use cases: - True (default): Production runs, save disk space - False: Development/debugging, keep local copies for inspection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| if ranks is not None and isinstance(ranks, str): | ||
| import ast | ||
|
|
||
| try: | ||
| ranks = ast.literal_eval(ranks) | ||
| if not isinstance(ranks, list): | ||
| log_rank_0( | ||
| f"[TraceLens] Warning: ranks evaluated to {type(ranks).__name__}, expected list. Using None." | ||
| ) | ||
| ranks = None | ||
| except (ValueError, SyntaxError) as e: | ||
| log_rank_0(f"[TraceLens] Warning: Failed to parse ranks '{ranks}': {e}. Using None.") | ||
| ranks = None |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing the ranks parameter as a string using ast.literal_eval, the code doesn't handle the case where ranks could be a single integer instead of a list. If a user mistakenly provides mlflow_tracelens_ranks: 0 instead of mlflow_tracelens_ranks: [0], the evaluation succeeds but the type check fails, resetting ranks to None. Consider adding explicit handling for integer input to convert it to a single-element list.
| if output_format == "all": | ||
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | ||
| csv_subdir = os.path.join(output_dir, report_name) | ||
| os.makedirs(csv_subdir, exist_ok=True) | ||
|
|
||
| # First call: Generate XLSX only | ||
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
|
|
||
| # Check XLSX output | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| # Second call: Generate CSVs only | ||
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | ||
|
|
||
| # Check CSV outputs (escape path to handle [] characters in filenames) | ||
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | ||
| if csv_files: | ||
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | ||
| generated_files.append(csv_subdir) # Upload directory to preserve structure |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "all" format, the function calls generate_perf_report_pytorch twice (once for XLSX, once for CSV), which means parsing the trace file twice. This could be inefficient for large trace files. Consider modifying the logic to parse the trace once and generate both outputs if the TraceLens API supports it.
| subprocess.check_call( | ||
| [sys.executable, "-m", "pip", "install", "openpyxl", "-q"], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed openpyxl") | ||
| return True | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.
| subprocess.check_call( | |
| [sys.executable, "-m", "pip", "install", "openpyxl", "-q"], | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.DEVNULL, | |
| ) | |
| log_rank_0("[TraceLens] Successfully installed openpyxl") | |
| return True | |
| except subprocess.CalledProcessError as e: | |
| warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}") | |
| result = subprocess.run( | |
| [sys.executable, "-m", "pip", "install", "openpyxl", "-q"], | |
| stdout=subprocess.DEVNULL, | |
| stderr=subprocess.PIPE, | |
| text=True, | |
| check=True, | |
| ) | |
| log_rank_0("[TraceLens] Successfully installed openpyxl") | |
| return True | |
| except subprocess.CalledProcessError as e: | |
| stderr_output = e.stderr.strip() if e.stderr else "No error output from pip." | |
| warning_rank_0( | |
| f"[TraceLens] Failed to install openpyxl: {e}. Stderr: {stderr_output}" | |
| ) |
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
|
|
||
| # Check XLSX output | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| # Second call: Generate CSVs only | ||
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'dfs' is assigned but never used in the "all" format branch. This appears to be leftover from debugging or an incomplete implementation. Consider removing the assignment or using the value if it serves a purpose.
| # CSV only: Multiple files in a subdirectory per rank | ||
| csv_subdir = os.path.join(output_dir, report_name) | ||
| os.makedirs(csv_subdir, exist_ok=True) | ||
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'dfs' is assigned but never used. Consider removing the assignment or using the value if it serves a purpose.
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | |
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) |
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "git+https://github.com/AMD-AGI/TraceLens.git", | ||
| "-q", | ||
| ], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installing packages from GitHub using subprocess without verifying the package integrity (e.g., checking commit hash or release tag) could pose a security risk. Consider pinning to a specific version/tag or commit hash for TraceLens installation to ensure reproducibility and security.
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "git+https://github.com/AMD-AGI/TraceLens.git", | ||
| "-q", | ||
| ], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subprocess call silently suppresses all output which could make debugging installation failures difficult. Consider logging stderr on failure or allowing users to see installation progress when it takes a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 13 comments.
| artifacts to MLflow. Only the rank that initialized MLflow (last rank) | ||
| should call this to avoid duplicate uploads. |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistency: The comment states "Only the rank that initialized MLflow (last rank)" which is correct, but could be more precise by stating it's rank (world_size - 1) to match the implementation in global_vars.py line 71.
| artifacts to MLflow. Only the rank that initialized MLflow (last rank) | |
| should call this to avoid duplicate uploads. | |
| artifacts to MLflow. Only the rank that initialized MLflow (rank | |
| ``world_size - 1``, i.e., the last rank) should call this to avoid | |
| duplicate uploads. |
| subprocess.check_call( | ||
| [sys.executable, "-m", "pip", "install", "openpyxl", "-q"], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed openpyxl") | ||
| return True | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}") | ||
| return False |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subprocess calls silence all output which makes troubleshooting difficult. When a pip install fails, users won't see helpful error messages. Consider using stderr=subprocess.STDOUT to capture errors in the exception, or at minimum log the stderr output when an error occurs.
| except Exception as e: | ||
| warning_rank_0(f"[TraceLens] Failed to cleanup reports directory: {e}") |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., IOError, OSError, shutil.Error) related to file operations rather than using a bare except clause.
| except Exception as e: | ||
| warning_rank_0(f"[MLflow] Failed to upload report {report_path}: {e}") |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.
| try: | ||
| # Load trace file | ||
| if trace_file.endswith(".gz"): | ||
| with gzip.open(trace_file, "rt", encoding="utf-8") as f: | ||
| trace_data = json.load(f) | ||
| else: | ||
| with open(trace_file, "r", encoding="utf-8") as f: | ||
| trace_data = json.load(f) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential security risk: The code opens user-provided trace files without validating their content or size. Large or malformed JSON files could cause memory exhaustion or denial of service. Consider adding file size limits and using streaming JSON parsers for large files.
| try: | ||
| # Load trace file | ||
| if trace_file.endswith(".gz"): | ||
| with gzip.open(trace_file, "rt", encoding="utf-8") as f: | ||
| trace_data = json.load(f) | ||
| else: | ||
| with open(trace_file, "r", encoding="utf-8") as f: | ||
| trace_data = json.load(f) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance concern: Loading entire trace files into memory could cause issues with large traces. PyTorch profiler traces can be hundreds of megabytes or larger in distributed training scenarios. Consider using streaming JSON parsers (like ijson) or add file size validation before attempting to load.
| except json.JSONDecodeError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to parse trace JSON: {e}") | ||
| return None | ||
| except Exception as e: |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling is too broad and swallows all exceptions including system exceptions like KeyboardInterrupt and SystemExit. Consider catching specific exceptions (e.g., json.JSONDecodeError, IOError, OSError) and allowing system exceptions to propagate.
| except Exception as e: | |
| except (OSError, ValueError, csv.Error, KeyError) as e: |
|
|
||
| mlflow_writer.log_artifact(log_file, artifact_path=artifact_subpath) | ||
| uploaded_count += 1 | ||
| except Exception as e: |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling is too broad and swallows all exceptions including system exceptions. Consider catching specific exceptions (e.g., OSError, IOError for file operations) rather than using a bare except clause.
| except Exception as e: | |
| except (OSError, IOError) as e: |
| def generate_tracelens_report( | ||
| trace_file: str, | ||
| output_dir: str, | ||
| report_name: Optional[str] = None, | ||
| output_format: str = "all", | ||
| ) -> List[str]: | ||
| """ | ||
| Generate a TraceLens analysis report for a single trace file. | ||
|
|
||
| Args: | ||
| trace_file: Path to the PyTorch profiler trace file (JSON/JSON.GZ) | ||
| output_dir: Directory to save the report | ||
| report_name: Optional custom name for the report (base name for CSVs) | ||
| output_format: Output format: | ||
| - "all" (default): Both XLSX and CSV files | ||
| - "xlsx": Single multi-tab Excel file with detailed analysis | ||
| - "csv": Multiple CSV files (kernels, memory, communication, etc.) | ||
|
|
||
| Returns: | ||
| List of paths to generated report files | ||
| """ | ||
| if not os.path.exists(trace_file): | ||
| warning_rank_0(f"[TraceLens] Trace file not found: {trace_file}") | ||
| return [] | ||
|
|
||
| os.makedirs(output_dir, exist_ok=True) | ||
|
|
||
| # Generate base name from trace filename if not provided | ||
| if report_name is None: | ||
| base_name = os.path.basename(trace_file) | ||
| # Remove extensions like .json.gz | ||
| for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]: | ||
| if base_name.endswith(trace_ext): | ||
| base_name = base_name[: -len(trace_ext)] | ||
| break | ||
| report_name = base_name | ||
|
|
||
| try: | ||
| # Try using TraceLens Python API directly | ||
| from TraceLens.Reporting import generate_perf_report_pytorch | ||
|
|
||
| generated_files = [] | ||
|
|
||
| # For "all" format: TraceLens uses either/or logic - if output_csvs_dir is set, | ||
| # it ONLY generates CSVs. So we need to call it twice for both formats. | ||
| if output_format == "all": | ||
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | ||
| csv_subdir = os.path.join(output_dir, report_name) | ||
| os.makedirs(csv_subdir, exist_ok=True) | ||
|
|
||
| # First call: Generate XLSX only | ||
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
|
|
||
| # Check XLSX output | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| # Second call: Generate CSVs only | ||
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | ||
|
|
||
| # Check CSV outputs (escape path to handle [] characters in filenames) | ||
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | ||
| if csv_files: | ||
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | ||
| generated_files.append(csv_subdir) # Upload directory to preserve structure | ||
|
|
||
| elif output_format == "xlsx": | ||
| # XLSX only: Single file with multiple tabs | ||
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | ||
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| elif output_format == "csv": | ||
| # CSV only: Multiple files in a subdirectory per rank | ||
| csv_subdir = os.path.join(output_dir, report_name) | ||
| os.makedirs(csv_subdir, exist_ok=True) | ||
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | ||
|
|
||
| # Collect all generated CSV files (escape path to handle [] characters in filenames) | ||
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | ||
| if csv_files: | ||
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | ||
| generated_files.append(csv_subdir) # Upload directory to preserve structure | ||
|
|
||
| if generated_files: | ||
| return generated_files | ||
|
|
||
| warning_rank_0(f"[TraceLens] No output files generated for: {trace_file}") | ||
| return [] | ||
|
|
||
| except ImportError: | ||
| log_rank_0("[TraceLens] TraceLens not available, using fallback CSV summary") | ||
| # Fallback to simple CSV summary | ||
| csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv") | ||
| return [csv_path] if csv_path else [] | ||
|
|
||
| except Exception as e: | ||
| warning_rank_0(f"[TraceLens] Error generating report: {e}") | ||
| # Fallback to simple CSV summary | ||
| csv_path = _generate_trace_summary_csv(trace_file, output_dir, f"{report_name}_summary.csv") | ||
| return [csv_path] if csv_path else [] | ||
|
|
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing input validation for output_format parameter. The function expects 'all', 'xlsx', or 'csv' but doesn't validate the input. If an invalid value is provided (e.g., 'json'), the function will silently generate no output. Consider adding validation at the start of the function to raise a clear error for invalid format values.
| for trace_ext in [".json.gz", ".json", ".pt.trace.json.gz", ".pt.trace.json"]: | ||
| if base_name.endswith(trace_ext): | ||
| base_name = base_name[: -len(trace_ext)] | ||
| break | ||
| report_name = base_name |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: The extension removal loop on line 376-379 uses a break statement which stops after the first matching extension. However, if a filename ends with multiple extensions that don't exactly match any single pattern (e.g., a file ending in ".trace.json.gz" when the patterns check for ".pt.trace.json.gz"), only the first matched extension would be removed. The logic appears correct for the given patterns, but could be fragile if patterns change. Consider using rsplit or a more explicit approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| True if openpyxl is available, False otherwise | ||
| """ | ||
| try: | ||
| import openpyxl # noqa: F401 |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'openpyxl' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import openpyxl as _ or adding a comment explaining this is an availability check.
| import openpyxl # noqa: F401 | |
| import openpyxl as _openpyxl # availability check |
| True if TraceLens is available, False otherwise | ||
| """ | ||
| try: | ||
| import TraceLens # noqa: F401 |
Copilot
AI
Jan 26, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'TraceLens' is imported but intentionally unused (indicated by the noqa comment). Consider using a more explicit pattern like import TraceLens as _ or adding a comment explaining this is an availability check.
| import TraceLens # noqa: F401 | |
| import TraceLens as _ # noqa: F401 # availability check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| generate_tracelens_report=getattr(args, "generate_tracelens_report", False), | ||
| upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False), |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config file sets generate_tracelens_report: true and mlflow_upload_tracelens_report: true as defaults (lines 28-29), but in trainer.py, the code uses getattr(args, "generate_tracelens_report", False) and getattr(args, "mlflow_upload_tracelens_report", False) with False as the fallback default (lines 1131-1132).
This creates an inconsistency: if these config values are not properly loaded into args for some reason, the fallback defaults in the code (False) differ from the config file defaults (True). This could lead to confusing behavior where users expect TraceLens reports based on the config file but don't get them due to the getattr fallback.
Consider one of:
- Use
Trueas the fallback default in getattr to match the config file - Change the config defaults to
falseto match the code fallback - Ensure the config values are always properly loaded so fallbacks aren't needed
- Add a comment explaining why the fallback defaults differ from config defaults
| generate_tracelens_report=getattr(args, "generate_tracelens_report", False), | |
| upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", False), | |
| generate_tracelens_report=getattr(args, "generate_tracelens_report", True), | |
| upload_tracelens_report=getattr(args, "mlflow_upload_tracelens_report", True), |
| def _ensure_openpyxl_installed() -> bool: | ||
| """ | ||
| Ensure openpyxl is installed for XLSX generation. | ||
|
|
||
| Returns: | ||
| True if openpyxl is available, False otherwise | ||
| """ | ||
| try: | ||
| import openpyxl # noqa: F401 | ||
|
|
||
| return True | ||
| except ImportError: | ||
| log_rank_0("[TraceLens] openpyxl not found, installing for XLSX support...") | ||
| try: | ||
| subprocess.check_call( | ||
| [sys.executable, "-m", "pip", "install", "openpyxl", "-q"], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed openpyxl") | ||
| return True | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install openpyxl: {e}") | ||
| return False | ||
|
|
||
|
|
||
| def _ensure_tracelens_installed() -> bool: | ||
| """ | ||
| Ensure TraceLens and its dependencies are installed. | ||
|
|
||
| TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens | ||
| XLSX generation requires openpyxl which is installed separately. | ||
|
|
||
| Returns: | ||
| True if TraceLens is available, False otherwise | ||
| """ | ||
| try: | ||
| import TraceLens # noqa: F401 | ||
|
|
||
| log_rank_0("[TraceLens] TraceLens is already installed") | ||
| except ImportError: | ||
| log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...") | ||
| try: | ||
| # TraceLens is on GitHub, not PyPI | ||
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "git+https://github.com/AMD-AGI/TraceLens.git", | ||
| "-q", | ||
| ], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub") | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}") | ||
| return False | ||
|
|
||
| # Ensure openpyxl is installed for XLSX generation | ||
| _ensure_openpyxl_installed() | ||
|
|
||
| return True |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-installation of packages from GitHub (TraceLens) and PyPI (openpyxl) during runtime could cause issues in production environments:
- Security concern: Installing packages at runtime without explicit user control or version pinning could introduce supply chain vulnerabilities
- Reproducibility: Different runs might get different package versions
- Environment restrictions: Many production environments disable pip install or restrict network access
- Multi-process risks: In distributed training, multiple ranks might attempt concurrent installations, leading to race conditions or failures
Consider one or more of these alternatives:
- Add TraceLens and openpyxl as optional dependencies in setup.py/requirements.txt
- Document that these dependencies should be pre-installed for TraceLens functionality
- Add a configuration option to disable auto-installation and just skip the feature if dependencies are missing
- At minimum, add version pinning to the installation commands (e.g., "git+https://github.com/AMD-AGI/TraceLens.git@v1.0.0")
| try: | ||
| import TraceLens # noqa: F401 | ||
|
|
||
| log_rank_0("[TraceLens] TraceLens is already installed") | ||
| except ImportError: | ||
| log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...") | ||
| try: | ||
| # TraceLens is on GitHub, not PyPI | ||
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "git+https://github.com/AMD-AGI/TraceLens.git", | ||
| "-q", | ||
| ], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub") | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}") | ||
| return False | ||
|
|
||
| # Ensure openpyxl is installed for XLSX generation | ||
| _ensure_openpyxl_installed() | ||
|
|
||
| return True |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function _ensure_tracelens_installed() has inconsistent return behavior. When TraceLens is already installed (line 260-262), the function logs a message but doesn't explicitly return True - it falls through to line 285-287 which ensures openpyxl is installed and then returns True. However, this behavior is implicit and could be fragile if the code is modified later.
For clarity and maintainability, add an explicit return True after line 262 or restructure the logic to make the control flow more explicit.
| def _ensure_tracelens_installed() -> bool: | ||
| """ | ||
| Ensure TraceLens and its dependencies are installed. | ||
|
|
||
| TraceLens is available from GitHub: https://github.com/AMD-AGI/TraceLens | ||
| XLSX generation requires openpyxl which is installed separately. | ||
|
|
||
| Returns: | ||
| True if TraceLens is available, False otherwise | ||
| """ | ||
| try: | ||
| import TraceLens # noqa: F401 | ||
|
|
||
| log_rank_0("[TraceLens] TraceLens is already installed") | ||
| except ImportError: | ||
| log_rank_0("[TraceLens] TraceLens not found, attempting to install from GitHub...") | ||
| try: | ||
| # TraceLens is on GitHub, not PyPI | ||
| subprocess.check_call( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "pip", | ||
| "install", | ||
| "git+https://github.com/AMD-AGI/TraceLens.git", | ||
| "-q", | ||
| ], | ||
| stdout=subprocess.DEVNULL, | ||
| stderr=subprocess.DEVNULL, | ||
| ) | ||
| log_rank_0("[TraceLens] Successfully installed TraceLens from GitHub") | ||
| except subprocess.CalledProcessError as e: | ||
| warning_rank_0(f"[TraceLens] Failed to install TraceLens: {e}") | ||
| return False | ||
|
|
||
| # Ensure openpyxl is installed for XLSX generation | ||
| _ensure_openpyxl_installed() | ||
|
|
||
| return True |
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In distributed training environments, if the TraceLens installation function is called by multiple ranks simultaneously (e.g., if upload_mlflow_artifacts is accidentally called from multiple ranks), there could be race conditions or conflicts when multiple processes try to install packages concurrently via pip.
While the current code only calls this from the rank that initializes MLflow (last rank), consider adding defensive checks or documentation to prevent issues if this function is called from the wrong rank. The comment on line 181 mentions "Only the rank that initialized MLflow (last rank) should call this" but there's no programmatic enforcement.
Consider adding a check at the beginning of the function to verify it's being called from the correct rank, or add a distributed barrier/lock mechanism if multiple ranks might call installation functions.
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
|
|
||
| # Check XLSX output | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| # Second call: Generate CSVs only | ||
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | ||
|
|
||
| # Check CSV outputs (escape path to handle [] characters in filenames) | ||
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | ||
| if csv_files: | ||
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | ||
| generated_files.append(csv_subdir) # Upload directory to preserve structure | ||
|
|
||
| elif output_format == "xlsx": | ||
| # XLSX only: Single file with multiple tabs | ||
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | ||
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | ||
| if os.path.exists(xlsx_path): | ||
| log_rank_0( | ||
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | ||
| ) | ||
| generated_files.append(xlsx_path) | ||
|
|
||
| elif output_format == "csv": | ||
| # CSV only: Multiple files in a subdirectory per rank | ||
| csv_subdir = os.path.join(output_dir, report_name) | ||
| os.makedirs(csv_subdir, exist_ok=True) | ||
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | ||
|
|
||
| # Collect all generated CSV files (escape path to handle [] characters in filenames) | ||
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | ||
| if csv_files: | ||
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | ||
| generated_files.append(csv_subdir) # Upload directory to preserve structure | ||
|
|
Copilot
AI
Feb 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generate_perf_report_pytorch function is called without catching potential exceptions from the TraceLens library itself (lines 396, 406, 417, 428). While there is a broad exception handler at line 448, if generate_perf_report_pytorch raises an exception on the first call (line 396) when generating XLSX in "all" mode, the second call for CSV generation (line 406) will be skipped, leading to incomplete output.
Consider wrapping each generate_perf_report_pytorch call individually in try-except blocks, especially in the "all" format case, so that if XLSX generation fails, CSV generation can still be attempted, and vice versa. This would make the function more resilient to partial failures.
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | |
| # Check XLSX output | |
| if os.path.exists(xlsx_path): | |
| log_rank_0( | |
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | |
| ) | |
| generated_files.append(xlsx_path) | |
| # Second call: Generate CSVs only | |
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | |
| # Check CSV outputs (escape path to handle [] characters in filenames) | |
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | |
| if csv_files: | |
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | |
| generated_files.append(csv_subdir) # Upload directory to preserve structure | |
| elif output_format == "xlsx": | |
| # XLSX only: Single file with multiple tabs | |
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | |
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | |
| if os.path.exists(xlsx_path): | |
| log_rank_0( | |
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | |
| ) | |
| generated_files.append(xlsx_path) | |
| elif output_format == "csv": | |
| # CSV only: Multiple files in a subdirectory per rank | |
| csv_subdir = os.path.join(output_dir, report_name) | |
| os.makedirs(csv_subdir, exist_ok=True) | |
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | |
| # Collect all generated CSV files (escape path to handle [] characters in filenames) | |
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | |
| if csv_files: | |
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | |
| generated_files.append(csv_subdir) # Upload directory to preserve structure | |
| try: | |
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | |
| except Exception as e: | |
| warning_rank_0( | |
| f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}" | |
| ) | |
| else: | |
| # Check XLSX output | |
| if os.path.exists(xlsx_path): | |
| log_rank_0( | |
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | |
| ) | |
| generated_files.append(xlsx_path) | |
| # Second call: Generate CSVs only | |
| try: | |
| generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | |
| except Exception as e: | |
| warning_rank_0( | |
| f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}" | |
| ) | |
| else: | |
| # Check CSV outputs (escape path to handle [] characters in filenames) | |
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | |
| if csv_files: | |
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | |
| generated_files.append(csv_subdir) # Upload directory to preserve structure | |
| elif output_format == "xlsx": | |
| # XLSX only: Single file with multiple tabs | |
| xlsx_path = os.path.join(output_dir, f"{report_name}_analysis.xlsx") | |
| try: | |
| dfs = generate_perf_report_pytorch(trace_file, output_xlsx_path=xlsx_path) | |
| except Exception as e: | |
| warning_rank_0( | |
| f"[TraceLens] Failed to generate XLSX report for {report_name}: {e}" | |
| ) | |
| else: | |
| if os.path.exists(xlsx_path): | |
| log_rank_0( | |
| f"[TraceLens] Generated XLSX report with {len(dfs)} tabs: {os.path.basename(xlsx_path)}" | |
| ) | |
| generated_files.append(xlsx_path) | |
| elif output_format == "csv": | |
| # CSV only: Multiple files in a subdirectory per rank | |
| csv_subdir = os.path.join(output_dir, report_name) | |
| os.makedirs(csv_subdir, exist_ok=True) | |
| try: | |
| dfs = generate_perf_report_pytorch(trace_file, output_csvs_dir=csv_subdir) | |
| except Exception as e: | |
| warning_rank_0( | |
| f"[TraceLens] Failed to generate CSV reports for {report_name}: {e}" | |
| ) | |
| else: | |
| # Collect all generated CSV files (escape path to handle [] characters in filenames) | |
| csv_files = glob.glob(os.path.join(glob.escape(csv_subdir), "*.csv")) | |
| if csv_files: | |
| log_rank_0(f"[TraceLens] Generated {len(csv_files)} CSV files for {report_name}") | |
| generated_files.append(csv_subdir) # Upload directory to preserve structure |
Move upload_mlflow_artifacts() from global_vars.py to new mlflow_setup.py to reduce merge conflicts. Includes TraceLens report generation and upload parameters. global_vars.py now matches main, avoiding future conflicts when merging from main branch.
feat: Add TraceLens integration for trace analysis with MLflow upload
Adds TraceLens trace analysis capability to automatically generate performance
reports from PyTorch profiler traces and upload them to MLflow.
Features
artifacts/trace_analysis/Config Options