Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR introduces a new modular profiling infrastructure for Triton kernel performance analysis using NVIDIA Nsight Compute (NCU). The module provides wrapper script generation, profiling execution, and metrics collection.

Core Components

1. NCUWrapperGenerator (ncu_wrapper_generator.py)

  • Generates NCU wrapper scripts from kernel and problem files
  • Uses Jinja2 templates for flexible, maintainable generation

2. KernelProfiler (kernel_profiler.py)

  • Profiles Triton kernels using NCU with automatic retry logic
  • Organize NCU output into csv and json file
  • Empirically choose a set of NCU metrics for kernel optimization

3. Jinja2 Template (ncu_wrapper_template.j2)

  • Supports conditional blocks for feature toggles
  • Handles multiple kernel types:
    • Standard kernels: kernel_function(*inputs)
    • Conv/Linear kernels: Extracts weights from Model instances
    • RMSNorm kernels: Passes init_inputs (features, eps)

Test Results:

The profiling module successfully extracts NCU metrics:

{
  "metrics": {
    "void at::vectorized_elementwise_kernel<4, ...>": {
      "dram__throughput.avg.pct_of_peak_sustained_elapsed": 91.25,
      "sm__warps_active.avg.pct_of_peak_sustained_active": 88.6,
      ...
    }
  },
  "metadata": {
    "kernel_file": "...",
    "round_num": 1,
    "timestamp": "2026-01-07T22:25:46.928773Z"
  }
}

Next steps:

This module is designed to add profiling functionality to future integration into opt_worker.py. Future PRs will:

  1. Add remaining components (benchmarking, prompts, verification, etc.)
  2. Import and integrate all modules in opt_worker.py

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2026
@kaiming-cheng kaiming-cheng requested a review from Laurawly January 7, 2026 22:50
Comment on lines 21 to 26
try:
from jinja2 import Template

HAS_JINJA2 = True
except ImportError:
HAS_JINJA2 = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we do this check in main too, but we don't need to

We should assume that users have a correct environment for common imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - fixed

logger: Logger instance
"""
self.logger = logger
self._template_cache: Optional[Template] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer | None

if self._template_cache is not None:
return self._template_cache

if not HAS_JINJA2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto above

HAS_JINJA2 = False


class NCUWrapperGenerator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generator is a loaded term maybe Factory?

Suggested change
class NCUWrapperGenerator:
class NCUWrapperFactory:

Comment on lines 38 to 44
__all__ = [
"METRICS",
"METRIC_COLUMNS",
"profile_triton_kernel",
"load_ncu_metrics",
"metrics_to_prompt",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need this?

metrics_file = self.logs_dir / f"round{round_num:03d}_ncu_metrics.json"

# Build metadata
metadata = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment, let's make a dataclass to pass around

benchmark_script: Path,
workdir: Path,
out_csv: str = "ncu_output.csv",
python_executable: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we passed this to NCU command to specify the python executable for the benchmark script

sub = pd.concat(results, ignore_index=True)
else:
sub = pd.DataFrame(columns=keep_cols)
elif select in ("first", "last", "max_cycles"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this an enum

@Jack-Khuu
Copy link
Contributor

Thanks again for modularizing everything! Few question/comments:

  1. How will modules interact with KernelProfiler? Is it a single instance that will be used for different kernel_files/problem_files via profile_kernel or is every problem/kernel creating their own profiler?

    • This affects how you'd want to handle caching and file logging
  2. Let's use the native typehints instead of importing from typing Updating typehints (Optional, Dict, List, Tuple) to Python 3.10+ standard #66

@kaiming-cheng
Copy link
Contributor Author

Thanks again for modularizing everything! Few question/comments:

  1. How will modules interact with KernelProfiler? Is it a single instance that will be used for different kernel_files/problem_files via profile_kernel or is every problem/kernel creating their own profiler?

    • This affects how you'd want to handle caching and file logging
  2. Let's use the native typehints instead of importing from typing Updating typehints (Optional, Dict, List, Tuple) to Python 3.10+ standard #66

Thanks for reviewing this PR! For each OptimizationWorker, it will create one KernelProfiler instance. The OptimizationWorker will reuse the KernelProfiler across different kernel files. Currently we assume the problem_file remains the same when doing optimization.

@kaiming-cheng kaiming-cheng changed the title Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) [Optimization 1/n] Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) Jan 13, 2026
@Laurawly
Copy link
Contributor

Laurawly commented Jan 15, 2026

Could you rebase/update the series onto latest main (and ideally stack them 70 → 71 → 73 → 77 → 78) so each PR diff is minimal and merge-order is clear?

@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_1 branch from 29fc219 to 543453a Compare January 15, 2026 19:48

[tool.setuptools.packages.find]
include = ["triton_kernel_agent*", "Fuser*", "scripts", "utils"]
include = ["triton_kernel_agent*", "kernel_perf_agent*", "Fuser*", "scripts", "utils"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing namespace config or init.py to pick up kernel_perf_agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Add the init.py in the latest commit


# Build NCU command
cmd = [
"sudo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider not forcing hardcode sudo in library code and make it opt-in instead? We can change profile_triton_kernel(... ) to accept use_sudo: bool = False (and/or read env like KERNELAGENT_NCU_USE_SUDO=1. If the non-sudo run fails with a known permission error (NCU often reports profiling permission restrictions), raise a clearer error: “NCU requires admin profiling permissions on this system; rerun with use_sudo=True or adjust driver setting / cluster policy.”

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great feedback! I've updated this in the latest commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants