[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73

kaiming-cheng · 2026-01-13T20:41:17Z

This PR introduces a diagnose module building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.

Core Components

1. MetricSchema (`metric_schema.py`)

Defines single source of truth for NCU profiling metrics (keys, labels, units)
Organizes metrics into different sections: SM & Compute Utilization, Memory Bandwidth & Cache, Memory Access Patterns, Occupancy & Resources, Stall Metrics
Extensible schema design: Metrics can be easily added, removed, or recategorized by editing the schema, supporting iterative experimentation

2. GPU Specs (`gpu_specs.py`)

GPU specifications database for Nvidia A100, H100, RTX 4090, RTX 5080
Auto-detection via nvidia-smi with fuzzy matching support
Hardware specs include: peak compute (FP32/FP16/BF16), memory bandwidth, SM count, cache sizes, memory type

3. Judger Prompts (`judger_prompts.py`)

Prompt builder for the Judge LLM dual-bottleneck analysis
Integrates section renderers for composable prompt construction:
Response extraction with multi-strategy JSON parsing

Example Usage

from kernel_perf_agent.kernel_opt.diagnose_prompt import
get_gpu_specs, build_judge_optimization_prompt


    specs = get_gpu_specs()
    print(f"\nUsing specs for: {specs['name']} ({specs.get('architecture', 'Unknown')})")
    print(f"  - Peak Memory Bandwidth: {specs['peak_memory_bw_gbps']} GB/s")
    print(f"  - Peak FP32 Performance: {specs['peak_fp32_tflops']} TFLOPS")
    print(f"  - SM Count: {specs['sm_count']}")

Detected GPU: NVIDIA H100

Using specs for: NVIDIA H100 (Hopper)

Peak Memory Bandwidth: 3352 GB/s
Peak FP32 Performance: 51.0 TFLOPS
SM Count: 132

system_prompt, user_prompt = build_judge_optimization_prompt(
  kernel_code=kernel_code,
  problem_description=problem_desc,
  ncu_metrics=ncu_metrics,
  gpu_specs=specs
)

You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......

Jack-Khuu · 2026-01-14T18:01:40Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    # Return default if detection failed
+    if gpu_name is None:
+        print("⚠️  GPU auto-detection failed, using A100 specs as fallback")
+        return GPU_SPECS_DATABASE["NVIDIA A100"].copy()


Should we fall back to A100? Or does returning an empty dict make more sense?

I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?

I think that makes sense; if there are setup/detection issues then we shouldn't optimize

Jack-Khuu · 2026-01-14T18:03:51Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+
+# GPU specifications database
+# Sources: NVIDIA official specifications, manufacturer datasheets
+GPU_SPECS_DATABASE = {


Can we move this const to its own file? It makes module overriding easier

Jack-Khuu · 2026-01-14T18:09:25Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    gpu_name_lower = gpu_name.lower()
+    for key, specs in GPU_SPECS_DATABASE.items():
+        key_lower = key.lower()
+        # Check if either name contains the other
+        if gpu_name_lower in key_lower or key_lower in gpu_name_lower:
+            print(f"ℹ️  Matched '{gpu_name}' to '{key}' (fuzzy match)")
+            return specs.copy()


Curious if you've encountered this case before?

sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?

Jack-Khuu · 2026-01-14T18:10:15Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    if detected_name:
+        print(f"\nDetected GPU: {detected_name}")
+    else:
+        print("\nNo GPU detected (nvidia-smi not available)")


Suggested change

print("\nNo GPU detected (nvidia-smi not available)")

print("\nNo GPU detected (nvidia-smi not available)")

exit()

Jack-Khuu · 2026-01-14T18:12:56Z