Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR introduces a diagnose module building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.

Core Components

1. MetricSchema (metric_schema.py)

  • Defines single source of truth for NCU profiling metrics (keys, labels, units)
  • Organizes metrics into different sections: SM & Compute Utilization, Memory Bandwidth & Cache, Memory Access Patterns, Occupancy & Resources, Stall Metrics
  • Extensible schema design: Metrics can be easily added, removed, or recategorized by editing the schema, supporting iterative experimentation

2. GPU Specs (gpu_specs.py)

  • GPU specifications database for Nvidia A100, H100, RTX 4090, RTX 5080
  • Auto-detection via nvidia-smi with fuzzy matching support
  • Hardware specs include: peak compute (FP32/FP16/BF16), memory bandwidth, SM count, cache sizes, memory type

3. **Judger Prompts ** (judger_prompts.py)

  • Prompt builder for the Judge LLM dual-bottleneck analysis
  • Integrates section renderers for composable prompt construction:
  • Response extraction with multi-strategy JSON parsing

Example Usage

from kernel_perf_agent.kernel_opt.diagnose_prompt import
get_gpu_specs, build_judge_optimization_prompt


    specs = get_gpu_specs()
    print(f"\nUsing specs for: {specs['name']} ({specs.get('architecture', 'Unknown')})")
    print(f"  - Peak Memory Bandwidth: {specs['peak_memory_bw_gbps']} GB/s")
    print(f"  - Peak FP32 Performance: {specs['peak_fp32_tflops']} TFLOPS")
    print(f"  - SM Count: {specs['sm_count']}")

Detected GPU: NVIDIA H100

Using specs for: NVIDIA H100 (Hopper)

  • Peak Memory Bandwidth: 3352 GB/s
  • Peak FP32 Performance: 51.0 TFLOPS
  • SM Count: 132
system_prompt, user_prompt = build_judge_optimization_prompt(
  kernel_code=kernel_code,
  problem_description=problem_desc,
  ncu_metrics=ncu_metrics,
  gpu_specs=specs
)

You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 13, 2026
@kaiming-cheng kaiming-cheng changed the title Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) [Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) Jan 13, 2026
Comment on lines +148 to +96
# Return default if detection failed
if gpu_name is None:
print("⚠️ GPU auto-detection failed, using A100 specs as fallback")
return GPU_SPECS_DATABASE["NVIDIA A100"].copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fall back to A100? Or does returning an empty dict make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?

Copy link
Contributor

@Jack-Khuu Jack-Khuu Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense; if there are setup/detection issues then we shouldn't optimize


# GPU specifications database
# Sources: NVIDIA official specifications, manufacturer datasheets
GPU_SPECS_DATABASE = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this const to its own file? It makes module overriding easier

Comment on lines +158 to +109
gpu_name_lower = gpu_name.lower()
for key, specs in GPU_SPECS_DATABASE.items():
key_lower = key.lower()
# Check if either name contains the other
if gpu_name_lower in key_lower or key_lower in gpu_name_lower:
print(f"ℹ️ Matched '{gpu_name}' to '{key}' (fuzzy match)")
return specs.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you've encountered this case before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum it

if detected_name:
print(f"\nDetected GPU: {detected_name}")
else:
print("\nNo GPU detected (nvidia-smi not available)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("\nNo GPU detected (nvidia-smi not available)")
print("\nNo GPU detected (nvidia-smi not available)")
exit()

Metric definitions are in metric_schema.py.
"""

from typing import Any, Callable, Dict, List, Optional, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😦


for label, key, unit in GPU_SPEC_FIELDS:
value = gpu_specs.get(key, "N/A")
lines.append(f"- **{label}:** {value}{unit}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the unit if N/A

raise ValueError("NCU metrics are empty - cannot build judge prompt")

# Extract first kernel's metrics for the metric getter
first_kernel = list(ncu_metrics.values())[0] if ncu_metrics else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check for empty above

except json.JSONDecodeError:
pass

# Strategy 2: Find first { ... } block with "bottleneck_1" field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are strategy 2/3 typically encountered?

Not required for this PR, but we can look into forcing the llm providers to provide a structured output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really actually. All my experiments return dual bottleneck analysis

) and _validate_bottleneck_entry(analysis["bottleneck_2"])


VALID_CATEGORIES = frozenset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozenset isn't wrong, but we're just using it as a lookup so normal set is fine

@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_3 branch from ac11151 to bfa2fd0 Compare January 15, 2026 19:21
Kaiming Cheng added 22 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_3 branch from af9b7af to e2c599e Compare January 15, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants