Skip to content

Conversation

@tuxerrante
Copy link
Collaborator

@tuxerrante tuxerrante commented Dec 23, 2025

Note

Profiling, as debugging and security hardening is a process and not a one-shot feature. Efforts should be put in simplifying the automation and onboarding the whole team into this process.
🏋️‍♀️ WIP

Jira: ARO-23245

What this PR does / why we need it:

I've refactored the pprof implementation to be production-ready and explicit.

Here's what was done:

  • cmd/aro/pprof.go (new file) -
    Dedicated pprof server implementation with: Environment variable-based configuration (PPROF_ENABLED, PPROF_PORT, PPROF_HOST)
    Localhost-only binding for security (rejects 0.0.0.0 or external IPs)
    Request origin validation middleware that blocks non-localhost requests
    Port collision detection before starting
    Graceful shutdown support
    Configurable timeouts

  • cmd/aro/pprof_test.go (new file) -
    Comprehensive unit tests covering:
    Enable/disable logic
    Port configuration
    Host configuration
    Localhost validation
    Server start/stop lifecycle
    Port collision detection
    Nil-safety

  • cmd/aro/const.go - Added pprof environment variable constants

  • cmd/aro/main.go - Replaced anonymous pprof goroutine with proper server initialization

  • env - Added pprof configuration variables

  • Dockerfile.ci-rp - Enabled pprof during CI tests

  • Makefile - Added comprehensive pprof profiling targets:
    pprof-check - Check if pprof server is running
    pprof-collect-all - Collect all profile types (CPU, heap, goroutine, allocs, block, mutex, trace)
    pprof-cpu, pprof-heap, pprof-goroutine, etc. - Open specific profiles in browser
    pprof-trace - Collect and view execution trace
    loadtest-hey - Run load tests with hey
    loadtest-vegeta - Run load tests with vegeta
    pprof-loadtest - Combined load test + profiling
    pprof-clean - Clean up profile outputs
    .gitignore - Added /pprof-data/ directory

Security Features

Localhost-only binding: The server only binds to 127.0.0.1, localhost, ::1, or [::1]
Request origin validation: Middleware rejects requests from non-localhost IPs
Default disabled: In production (RP_MODE not set to development), pprof is disabled by default
Explicit enable required: Must set PPROF_ENABLED=true to enable in production

Why pprof is in cmd/aro?
The pprof server is in cmd/aro because it's application-level profiling for the entire binary
It needs to start before any service-specific code, it should be available across all service modes (rp, monitor, portal, etc.) and it's not a reusable package - it's entry-point configuration

Test plan for issue:

Usage:

# TERMINAL 1
❯ make runlocal-rp


# =========
# TERMINAL 2
❯ make pprof-profile-endpoint ENDPOINT=all DURATION=10s RATE=10

Note: Ensure the RP is running with pprof enabled:
  Terminal 1: make runlocal-rp
  (pprof is enabled by default in development mode)

Extracting endpoints from swagger: swagger/redhatopenshift/resource-manager/Microsoft.RedHatOpenShift/openshiftclusters/stable/2025-07-25/redhatopenshift.json

Found 14 endpoints to profile

Note: After running 'make runlocal-rp', the RP frontend server is available.
      However, many endpoints require:
      - Authentication (mutual TLS or MISE)
      - Existing resources (subscriptions, clusters, etc.)
      - Valid API versions

      Failed requests (401, 404, etc.) are expected for endpoints that require setup.
      The profiling will still capture the server's behavior under load.

Continue? (y/N) y

>>> Profiling: /healthz/ready (GET) <<<

==========================================
Profiling endpoint: /healthz/ready
HTTP Method: GET
Substituted path: /healthz/ready
API version: 2025-07-25
URL: https://localhost:8443/healthz/ready?api-version=2025-07-25
Duration: 10s
Rate: 10 req/s
Profile prefix: healthz-ready
==========================================

Starting vegeta attack in background...
Note: Most endpoints require authentication (mutual TLS or MISE).
      Without proper auth, you may see:
      - 400 Bad Request: Missing/invalid api-version, validation failure, or resource not found
      - 403 Forbidden: Authentication required
      - 404 Not Found: Resource doesn't exist (expected for test data)
      - 405 Method Not Allowed: Wrong HTTP method (should be fixed now)
      Profiling will still capture server behavior under load.

Vegeta PID: 52364

Collecting profiles during load test...
  → CPU profile (10 seconds)...
    ✓ CPU: ./pprof-data/healthz-ready-cpu.prof
  → Heap profile...
    ✓ Heap: ./pprof-data/healthz-ready-heap.prof
  → Allocs profile...
    ✓ Allocs: ./pprof-data/healthz-ready-allocs.prof
  → Goroutine profile...
    ✓ Goroutine: ./pprof-data/healthz-ready-goroutine.prof
  → Block profile...
    ✓ Block: ./pprof-data/healthz-ready-block.prof
  → Mutex profile...
    ✓ Mutex: ./pprof-data/healthz-ready-mutex.prof
  → Threadcreate profile...
    ✓ Threadcreate: ./pprof-data/healthz-ready-threadcreate.prof
  → Execution trace (5s)...
    ✓ Trace: ./pprof-data/healthz-ready-trace.out

Waiting for vegeta to finish...

Vegeta report:
Requests      [total, rate, throughput]  100, 10.10, 10.10
Duration      [total, attack, wait]      9.903085125s, 9.900053334s, 3.031791ms
Latencies     [mean, 50, 95, 99, max]    2.999257ms, 2.842792ms, 4.239687ms, 10.494291ms, 16.456375ms
Bytes In      [total, mean]              300, 3.00
Bytes Out     [total, mean]              0, 0.00
Success       [ratio]                    100.00%
Status Codes  [code:count]               200:100
Error Set:

==========================================
Profile collection complete for: /healthz/ready
==========================================
  • Are there unit tests? YES
  • Are there integration/e2e tests? NO
  • Automated tests in the Makefile

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Endpoints are secured to be accessible only from the localhost.
Also whenever production usage will be ready, specific env vars should be enabled.

Screenshots

image

- Introduced environment variables for pprof configuration in const.go.
- Updated main.go to start a pprof server if enabled.
- Enhanced Makefile with pprof-related targets for profiling and load testing.
- Modified Dockerfile.ci-rp to enable pprof during test runs.
- Updated .gitignore to exclude pprof data directory.
@tuxerrante tuxerrante added enhancement New feature or request priority-low Low priority issue or pull request size-large Size large go Pull requests that update Go code labels Dec 29, 2025
- Updated the endpoint name sanitization to use the substituted path instead of the original endpoint.
- Ensured that the endpoint name defaults to "endpoint" if sanitization results in an empty value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request go Pull requests that update Go code priority-low Low priority issue or pull request size-large Size large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants