[Optimization 1/n] Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) #70

kaiming-cheng · 2026-01-07T22:47:28Z

This PR introduces a new modular profiling infrastructure for Triton kernel performance analysis using NVIDIA Nsight Compute (NCU). The module provides wrapper script generation, profiling execution, and metrics collection.

Core Components

1. NCUWrapperGenerator (`ncu_wrapper_generator.py`)

Generates NCU wrapper scripts from kernel and problem files
Uses Jinja2 templates for flexible, maintainable generation

2. KernelProfiler (`kernel_profiler.py`)

Profiles Triton kernels using NCU with automatic retry logic
Organize NCU output into csv and json file
Empirically choose a set of NCU metrics for kernel optimization

3. Jinja2 Template (`ncu_wrapper_template.j2`)

Supports conditional blocks for feature toggles
Handles multiple kernel types:
- Standard kernels: kernel_function(*inputs)
- Conv/Linear kernels: Extracts weights from Model instances
- RMSNorm kernels: Passes init_inputs (features, eps)

Test Results:

The profiling module successfully extracts NCU metrics:

{
  "metrics": {
    "void at::vectorized_elementwise_kernel<4, ...>": {
      "dram__throughput.avg.pct_of_peak_sustained_elapsed": 91.25,
      "sm__warps_active.avg.pct_of_peak_sustained_active": 88.6,
      ...
    }
  },
  "metadata": {
    "kernel_file": "...",
    "round_num": 1,
    "timestamp": "2026-01-07T22:25:46.928773Z"
  }
}

Next steps:

This module is designed to add profiling functionality to future integration into opt_worker.py. Future PRs will:

Add remaining components (benchmarking, prompts, verification, etc.)
Import and integrate all modules in opt_worker.py

Jack-Khuu · 2026-01-10T02:12:48Z

triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_generator.py

+try:
+    from jinja2 import Template
+
+    HAS_JINJA2 = True
+except ImportError:
+    HAS_JINJA2 = False


I know we do this check in main too, but we don't need to

We should assume that users have a correct environment for common imports

Good point - fixed

Jack-Khuu · 2026-01-10T02:14:02Z

triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_generator.py

+            logger: Logger instance
+        """
+        self.logger = logger
+        self._template_cache: Optional[Template] = None


Prefer | None

Jack-Khuu · 2026-01-10T02:14:42Z

triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_generator.py

+        if self._template_cache is not None:
+            return self._template_cache
+
+        if not HAS_JINJA2:


ditto above

Jack-Khuu · 2026-01-10T02:22:45Z

triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_generator.py

+    HAS_JINJA2 = False
+
+
+class NCUWrapperGenerator:


Generator is a loaded term maybe Factory?

Suggested change

class NCUWrapperGenerator:

class NCUWrapperFactory:

Jack-Khuu · 2026-01-10T02:31:57Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+__all__ = [
+    "METRICS",
+    "METRIC_COLUMNS",
+    "profile_triton_kernel",
+    "load_ncu_metrics",
+    "metrics_to_prompt",
+]


Not sure we need this?

Jack-Khuu · 2026-01-10T03:26:18Z

triton_kernel_agent/opt_worker_component/profiling/kernel_profiler.py

+        metrics_file = self.logs_dir / f"round{round_num:03d}_ncu_metrics.json"
+
+        # Build metadata
+        metadata = {


See other comment, let's make a dataclass to pass around

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

Jack-Khuu · 2026-01-10T03:28:35Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    benchmark_script: Path,
+    workdir: Path,
+    out_csv: str = "ncu_output.csv",
+    python_executable: Optional[str] = None,


Is this being used?

we passed this to NCU command to specify the python executable for the benchmark script

Jack-Khuu · 2026-01-10T03:33:31Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+            sub = pd.concat(results, ignore_index=True)
+        else:
+            sub = pd.DataFrame(columns=keep_cols)
+    elif select in ("first", "last", "max_cycles"):


Let's make this an enum

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

Jack-Khuu · 2026-01-10T03:43:21Z

Thanks again for modularizing everything! Few question/comments:

How will modules interact with KernelProfiler? Is it a single instance that will be used for different kernel_files/problem_files via profile_kernel or is every problem/kernel creating their own profiler?
- This affects how you'd want to handle caching and file logging
Let's use the native typehints instead of importing from typing Updating typehints (Optional, Dict, List, Tuple) to Python 3.10+ standard #66

kaiming-cheng · 2026-01-13T01:31:27Z

Thanks again for modularizing everything! Few question/comments:

How will modules interact with KernelProfiler? Is it a single instance that will be used for different kernel_files/problem_files via profile_kernel or is every problem/kernel creating their own profiler?

This affects how you'd want to handle caching and file logging

Let's use the native typehints instead of importing from typing Updating typehints (Optional, Dict, List, Tuple) to Python 3.10+ standard #66

Thanks for reviewing this PR! For each OptimizationWorker, it will create one KernelProfiler instance. The OptimizationWorker will reuse the KernelProfiler across different kernel files. Currently we assume the problem_file remains the same when doing optimization.

Laurawly · 2026-01-15T19:18:55Z

Could you rebase/update the series onto latest main (and ideally stack them 70 → 71 → 73 → 77 → 78) so each PR diff is minimal and merge-order is clear?

Laurawly · 2026-01-19T22:40:07Z

pyproject.toml


 [tool.setuptools.packages.find]
-include = ["triton_kernel_agent*", "Fuser*", "scripts", "utils"]
+include = ["triton_kernel_agent*", "kernel_perf_agent*", "Fuser*", "scripts", "utils"]


Are we missing namespace config or init.py to pick up kernel_perf_agent?

Good catch! Add the init.py in the latest commit

Laurawly · 2026-01-19T22:49:15Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+
+    # Build NCU command
+    cmd = [
+        "sudo",


Shall we consider not forcing hardcode sudo in library code and make it opt-in instead? We can change profile_triton_kernel(... ) to accept use_sudo: bool = False (and/or read env like KERNELAGENT_NCU_USE_SUDO=1. If the non-sudo run fails with a known permission error (NCU often reports profiling permission restrictions), raise a clearer error: “NCU requires admin profiling permissions on this system; rerun with use_sudo=True or adjust driver setting / cluster policy.”

Thanks for the great feedback! I've updated this in the latest commit

kaiming-cheng requested a review from Jack-Khuu January 7, 2026 22:47

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 7, 2026

kaiming-cheng requested a review from Laurawly January 7, 2026 22:50

Jack-Khuu requested changes Jan 10, 2026

View reviewed changes

kaiming-cheng requested a review from Jack-Khuu January 13, 2026 01:24

kaiming-cheng changed the title ~~Add Profiling Module (NCU Wrapper Generation & Kernel Profiling)~~ [Optimization 1/n] Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) Jan 13, 2026

Kaiming Cheng added 11 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

kaiming-cheng force-pushed the kaiming/opt_component_1 branch from 29fc219 to 543453a Compare January 15, 2026 19:48

Laurawly reviewed Jan 19, 2026

View reviewed changes

address reviews to init kernel_perf_agent and gracefully handle sudo

16752e5

[Optimization 1/n] Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) #70

Are you sure you want to change the base?

[Optimization 1/n] Add Profiling Module (NCU Wrapper Generation & Kernel Profiling) #70

Uh oh!

Conversation

kaiming-cheng commented Jan 7, 2026

Core Components

1. NCUWrapperGenerator (ncu_wrapper_generator.py)

2. KernelProfiler (kernel_profiler.py)

3. Jinja2 Template (ncu_wrapper_template.j2)

Test Results:

Next steps:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jack-Khuu commented Jan 10, 2026

Uh oh!

kaiming-cheng commented Jan 13, 2026

Uh oh!

Laurawly commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. NCUWrapperGenerator (`ncu_wrapper_generator.py`)

2. KernelProfiler (`kernel_profiler.py`)

3. Jinja2 Template (`ncu_wrapper_template.j2`)

Laurawly commented Jan 15, 2026 •

edited

Loading