Skip to content

Conversation

@rafabene
Copy link
Contributor

@rafabene rafabene commented Jan 13, 2026

Summary

  • Consolidate health endpoints (/healthz, /readyz) on metrics server port 8080
  • Implement graceful shutdown with configurable SHUTDOWN_TIMEOUT (default 20s)
  • Use signal.NotifyContext for SIGTERM/SIGINT handling
  • Add proper shutdown sequence: mark not ready, drain HTTP connections, cleanup resources

Test plan

  • Verify /healthz returns 200 OK on port 8080
  • Verify /readyz returns 200 OK when ready, 503 during shutdown
  • Test graceful shutdown with SIGTERM signal
  • Verify SHUTDOWN_TIMEOUT environment variable is respected
  • Run make test and make lint

Summary by CodeRabbit

  • New Features

    • Graceful shutdown with configurable timeout (default 20s) via flag/env.
    • Liveness and readiness probes added: GET /healthz and GET /readyz on the metrics port (8080); metrics endpoint unchanged.
  • Documentation

    • Removed legacy /healthcheck; docs updated to show /healthz and /readyz on port 8080 and describe graceful shutdown.
  • Tests

    • Added unit tests for shutdown behavior and health/liveness/readiness endpoints.

✏️ Tip: You can customize this high-level summary in your review settings.

Consolidate health endpoints on metrics server (port 8080) following
HyperFleet health-endpoints.md standard:

- Add /healthz (liveness) and /readyz (readiness) endpoints
- Remove separate healthcheck server (port 8083)
- Integrate readiness state with graceful shutdown
- Update Kubernetes probes in deployment.yaml
- Update documentation
@openshift-ci
Copy link

openshift-ci bot commented Jan 13, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tirthct for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 13, 2026

Walkthrough

This PR removes the standalone health-check server and its config, consolidates health endpoints into the metrics server on the metrics port (:8080), and replaces /healthcheck with two probes: GET /healthz (liveness) and GET /readyz (readiness). It adds a readiness state manager and health handlers (including optional DB checks), introduces a Shutdown configuration (shutdown-timeout, env SHUTDOWN_TIMEOUT), updates servers/interfaces to accept context-based Stop(ctx), and implements a coordinated graceful shutdown sequence. Kubernetes manifests, charts, tests, and documentation are updated to reflect these changes.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant MetricsServer
    participant ReadinessState
    participant Database
    participant App

    App->>ReadinessState: SetNotReady()
    App->>MetricsServer: Start(:8080)
    App->>ReadinessState: SetReady()

    Client->>MetricsServer: GET /healthz
    MetricsServer->>MetricsServer: LivenessHandler -> 200
    MetricsServer-->>Client: 200

    Client->>MetricsServer: GET /readyz
    MetricsServer->>ReadinessState: IsReady()
    MetricsServer->>Database: Optional DB check
    MetricsServer-->>Client: 200 or 503 (with checks)

    Client->>App: SIGTERM
    App->>ReadinessState: SetShuttingDown()
    App->>MetricsServer: Stop(ctx) (graceful)
    App->>Database: Teardown
    App-->>Client: Exit or forced after timeout
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • rh-amarin
  • ciaranRoche
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the two main changes: implementing health endpoints (/healthz and /readyz consolidation) and graceful shutdown with SHUTDOWN_TIMEOUT, which directly match the primary objectives and code changes across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/helper.go (1)

109-116: Metrics server not stopped during teardown.

startMetricsServer() is called at line 116, but stopMetricsServer is not added to the teardowns slice. This means the metrics server will continue running after test teardown, potentially causing resource leaks or port conflicts in subsequent test runs.

🐛 Proposed fix
 		helper.teardowns = []func() error{
 			helper.CleanDB,
 			jwkMockTeardown,
 			helper.stopAPIServer,
+			helper.stopMetricsServer,
 			helper.teardownEnv,
 		}
 		helper.startAPIServer()
 		helper.startMetricsServer()
🤖 Fix all issues with AI agents
In @cmd/hyperfleet-api/servecmd/cmd.go:
- Around line 116-121: The telemetry shutdown uses context.Background() which
ignores the overall shutdown timeout; change the call to use shutdownCtx (the
cancellation context that enforces the shutdown deadline) so
telemetry.Shutdown(shutdownCtx, tp) is invoked (keep the tp nil check) and
retain the existing logging (logger.WithError(ctx, err).Error(...)) so errors
are reported if the shutdown is cancelled or fails.
🧹 Nitpick comments (8)
pkg/config/shutdown.go (1)

33-41: Consider logging a warning for invalid SHUTDOWN_TIMEOUT values.

When time.ParseDuration fails, the invalid value is silently ignored. This could make debugging configuration issues harder if an operator sets an invalid value like SHUTDOWN_TIMEOUT=20 (missing unit suffix).

♻️ Optional improvement to add a warning log
+import (
+	"log/slog"
+	"os"
+	"time"
+
+	"github.com/spf13/pflag"
+)

 // BindEnv binds environment variables to config values
 // Environment variables take precedence over flags
 func (s *ShutdownConfig) BindEnv() {
 	if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" {
 		if d, err := time.ParseDuration(val); err == nil {
 			s.Timeout = d
+		} else {
+			slog.Warn("invalid SHUTDOWN_TIMEOUT value, using default",
+				"value", val,
+				"default", s.Timeout,
+				"error", err)
 		}
 	}
 }
pkg/config/shutdown_test.go (1)

10-20: Minor redundancy in default value assertions.

Lines 13-15 and 17-19 both verify the same default timeout value. The second check (explicit 20*time.Second) is redundant since DefaultShutdownTimeout is defined as 20 * time.Second.

♻️ Optional: Remove redundant assertion
 func TestNewShutdownConfig_Defaults(t *testing.T) {
 	cfg := NewShutdownConfig()
 
 	if cfg.Timeout != DefaultShutdownTimeout {
 		t.Errorf("expected default timeout %v, got %v", DefaultShutdownTimeout, cfg.Timeout)
 	}
-
-	if cfg.Timeout != 20*time.Second {
-		t.Errorf("expected timeout 20s, got %v", cfg.Timeout)
-	}
 }
pkg/health/readiness_test.go (1)

93-100: Consider adding test isolation for the singleton test.

This test accesses the global singleton but doesn't reset its state. While it only performs read operations, if the singleton is modified by previous tests (e.g., in parallel execution), it could still pass but leave the global state in an unexpected condition. Consider calling Reset() at the end for consistency with other tests.

♻️ Suggested improvement
 func TestGetReadinessState_ReturnsSingleton(t *testing.T) {
 	state1 := GetReadinessState()
 	state2 := GetReadinessState()
 
 	if state1 != state2 {
 		t.Error("expected GetReadinessState() to return the same instance")
 	}
+
+	// Reset state for other tests
+	state1.Reset()
 }
pkg/health/handlers_test.go (1)

134-168: Consider adding a test for database connection failure.

The TestReadinessHandler_IncludesChecks test uses a nil sessionFactory, which causes the handler to report database: "ok" via the fallback path. Consider adding a test with a mock SessionFactory that returns an error from CheckConnection() to verify the database: "error" path and ensure 503 is returned when the database is unreachable.

pkg/health/handlers.go (2)

10-11: Unused type definition.

CheckFunc is defined but not used anywhere in this file or the tests. If this is intended for future extensibility (e.g., registering custom health checks), consider adding a TODO comment or removing it until needed.


67-77: Consider reporting "skipped" instead of "ok" when sessionFactory is nil.

When sessionFactory is nil, the handler reports database: "ok", which could mask a misconfiguration where the database dependency wasn't properly injected. Reporting "skipped" or "not_configured" would provide clearer diagnostics.

♻️ Suggested improvement
 	// Check database connection
 	if h.sessionFactory != nil {
 		if err := h.sessionFactory.CheckConnection(); err != nil {
 			checks["database"] = "error"
 			allOK = false
 		} else {
 			checks["database"] = "ok"
 		}
 	} else {
-		checks["database"] = "ok"
+		checks["database"] = "skipped"
 	}
pkg/health/readiness.go (2)

8-12: Consider renaming shutting to shuttingDown for clarity.

The field name shutting is less descriptive than the method IsShuttingDown() and SetShuttingDown(). Aligning the field name would improve code readability.

Suggested improvement
 type ReadinessState struct {
 	mu       sync.RWMutex
 	ready    bool
-	shutting bool
+	shuttingDown bool
 }

Then update references in methods accordingly.


62-68: Consider adding documentation about production usage.

The Reset() method is correctly implemented for testing purposes. Consider strengthening the documentation to warn against calling this in production code, as resetting the shutdown state mid-flight could cause unexpected behavior.

Suggested documentation improvement
-// Reset resets the readiness state (useful for testing)
+// Reset resets the readiness state to initial values.
+// WARNING: This method is intended for testing only. Do not call in production
+// as it may cause inconsistent state during graceful shutdown.
 func (r *ReadinessState) Reset() {
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0120ac6 and 5acbbe9.

📒 Files selected for processing (21)
  • AGENTS.md
  • README.md
  • charts/templates/deployment.yaml
  • charts/templates/service.yaml
  • charts/values.yaml
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/healthcheck_server.go
  • cmd/hyperfleet-api/server/logging/request_logging_middleware.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • docs/deployment.md
  • docs/development.md
  • docs/hyperfleet-api.http
  • pkg/config/config.go
  • pkg/config/health_check.go
  • pkg/config/shutdown.go
  • pkg/config/shutdown_test.go
  • pkg/health/handlers.go
  • pkg/health/handlers_test.go
  • pkg/health/readiness.go
  • pkg/health/readiness_test.go
  • test/helper.go
💤 Files with no reviewable changes (5)
  • charts/values.yaml
  • pkg/config/health_check.go
  • cmd/hyperfleet-api/server/logging/request_logging_middleware.go
  • charts/templates/service.yaml
  • cmd/hyperfleet-api/server/healthcheck_server.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID

Files:

  • pkg/health/handlers_test.go
  • pkg/config/shutdown.go
  • pkg/health/readiness.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/health/readiness_test.go
  • test/helper.go
  • pkg/health/handlers.go
  • pkg/config/config.go
  • pkg/config/shutdown_test.go
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags

Applied to files:

  • docs/development.md
  • docs/deployment.md
  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • charts/templates/deployment.yaml
  • README.md
  • test/helper.go
  • AGENTS.md
  • docs/hyperfleet-api.http
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes

Applied to files:

  • docs/development.md
  • pkg/health/handlers_test.go
  • docs/deployment.md
  • pkg/health/readiness.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • charts/templates/deployment.yaml
  • pkg/health/readiness_test.go
  • README.md
  • test/helper.go
  • pkg/health/handlers.go
  • AGENTS.md
  • docs/hyperfleet-api.http
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : HTTP handlers must follow the pattern: parse request → call service/DAO → handle errors → send response with appropriate HTTP status codes

Applied to files:

  • pkg/health/handlers_test.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Prometheus metrics are exposed at GET /metrics endpoint for monitoring; configure metric collection via --label-metrics-inclusion-duration

Applied to files:

  • pkg/health/handlers_test.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept resource creation/update/delete requests via REST endpoints, validate, and persist to PostgreSQL

Applied to files:

  • pkg/health/handlers_test.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/health/handlers.go
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.

Applied to files:

  • pkg/health/handlers_test.go
  • pkg/config/shutdown.go
  • pkg/health/readiness.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/health/readiness_test.go
  • test/helper.go
  • pkg/health/handlers.go
  • pkg/config/config.go
  • pkg/config/shutdown_test.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/services/**/*.go : Status aggregation: Phase is 'Ready' if all adapters report Ready=True, 'Failed' if any adapter reports Ready=False, 'NotReady' otherwise

Applied to files:

  • pkg/health/readiness.go
  • pkg/health/readiness_test.go
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,dao}/**/*.go : Errors must use structured ServiceError type from pkg/errors/ with HttpCode, Code, and Reason fields; send via errors.SendError(w, r, err)

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Serve resource data to Sentinel via GET /{resourceType} endpoints with pagination and filtering support

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/environments/**/*.go : Environment-based configuration: use OCM_ENV variable to select development, unit_testing, integration_testing, or production configuration

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,db}/**/*.go : Use context values for passing request-scoped data (session, operation ID, user info) rather than function parameters or global state

Applied to files:

  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Database transactions are automatically managed by TransactionMiddleware for all HTTP requests; store session in context and auto-commit on success, rollback on error

Applied to files:

  • pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept status updates from adapters via POST /{resourceType}/{id}/statuses and store in adapter_statuses table

Applied to files:

  • pkg/health/handlers.go
🧬 Code graph analysis (6)
cmd/hyperfleet-api/server/metrics_server.go (3)
pkg/health/handlers.go (1)
  • NewHandler (19-23)
cmd/hyperfleet-api/environments/types.go (1)
  • Database (36-38)
pkg/db/session.go (1)
  • SessionFactory (12-20)
pkg/health/readiness_test.go (1)
pkg/health/readiness.go (2)
  • ReadinessState (8-12)
  • GetReadinessState (21-23)
test/helper.go (5)
pkg/db/session.go (1)
  • SessionFactory (12-20)
pkg/config/config.go (1)
  • ApplicationConfig (15-22)
cmd/hyperfleet-api/environments/types.go (1)
  • ApplicationConfig (32-34)
cmd/hyperfleet-api/server/server.go (1)
  • Server (13-18)
test/factories/factory.go (1)
  • Factories (9-10)
pkg/health/handlers.go (2)
pkg/db/session.go (1)
  • SessionFactory (12-20)
pkg/health/readiness.go (1)
  • GetReadinessState (21-23)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
  • ShutdownConfig (15-17)
  • NewShutdownConfig (19-23)
pkg/config/shutdown_test.go (1)
pkg/config/shutdown.go (2)
  • NewShutdownConfig (19-23)
  • DefaultShutdownTimeout (12-12)
🔇 Additional comments (37)
test/helper.go (2)

56-68: LGTM!

The Helper struct field formatting is clean. The removal of HealthCheckServer and related fields aligns with the PR objective to consolidate health endpoints onto the metrics server.


164-179: LGTM!

The startMetricsServer and stopMetricsServer methods follow the same pattern as the API server methods, which is consistent and maintainable.

AGENTS.md (3)

347-352: LGTM!

Documentation accurately reflects the consolidation of health endpoints onto the metrics server and the new graceful shutdown configuration. The shutdown-timeout flag with environment variable binding (SHUTDOWN_TIMEOUT) is properly documented.


690-690: LGTM!

Port documentation correctly updated to reflect the consolidated health and metrics endpoints on port 8080. Based on learnings, this updates the previous port 8083 health check configuration.


775-778: LGTM!

Health endpoints documentation correctly describes /healthz for liveness and /readyz for readiness probes. This aligns with Kubernetes conventions and the PR objectives.

pkg/config/shutdown.go (2)

10-23: LGTM!

The default shutdown timeout of 20 seconds is a reasonable value that allows for connection draining while not delaying shutdown too long. The struct and constructor follow the established patterns in the codebase.


25-27: LGTM!

Flag registration follows the same pattern as other config types in the codebase.

pkg/config/config.go (4)

15-22: LGTM!

The ApplicationConfig struct correctly replaces HealthCheck with Shutdown, aligning with the PR's consolidation of health endpoints into the metrics server and addition of graceful shutdown configuration.


24-33: LGTM!

Constructor properly initializes all config components including the new ShutdownConfig.


35-43: LGTM!

AddFlags correctly includes Shutdown.AddFlags, ensuring the --shutdown-timeout flag is registered.


45-55: Intentional omission: Shutdown.ReadFiles not called.

Note that Shutdown.ReadFiles() is not included in the readFiles slice. This is acceptable since ShutdownConfig.ReadFiles() currently returns nil (no file-based configuration). If file-based shutdown configuration is added later, remember to wire it here.

pkg/config/shutdown_test.go (3)

22-52: LGTM!

Flag registration and parsing tests are thorough, verifying both the default value and custom value parsing via command-line arguments.


54-99: LGTM!

Excellent table-driven test coverage for BindEnv. The test cases cover valid durations (seconds and minutes), invalid duration strings, and empty values. Using t.Setenv ensures proper test isolation.


101-108: LGTM!

Simple but necessary test ensuring ReadFiles doesn't return an error, which is important for interface compliance even though it's currently a no-op.

docs/development.md (1)

98-100: Documentation correctly updated to reflect consolidated health endpoints.

The URLs now accurately point to the new /healthz and /readyz endpoints on port 8080, consistent with the code changes that consolidate health checks onto the metrics server.

README.md (1)

83-85: README correctly documents the new health probe endpoints.

The liveness and readiness probe URLs are accurate and consistent with the implementation in metrics_server.go and other documentation files.

charts/templates/deployment.yaml (1)

79-94: Kubernetes probes correctly configured for consolidated health endpoints.

The liveness (/healthz) and readiness (/readyz) probes are properly configured to use the metrics port. The timing parameters are reasonable:

  • Liveness: initialDelaySeconds=15, periodSeconds=20 — avoids premature pod restarts during startup
  • Readiness: initialDelaySeconds=5, periodSeconds=10 — enables quick traffic routing once ready

This configuration aligns with the PR's goal of consolidating health and metrics endpoints on port 8080.

docs/deployment.md (2)

81-84: Environment variable documentation accurately reflects new configuration.

The METRICS_PORT description correctly notes it now serves both metrics and health endpoints. The new SHUTDOWN_TIMEOUT variable is properly documented with its default value of 20s.


277-279: Health check documentation clearly describes probe behavior.

The descriptions accurately differentiate liveness (process alive check) from readiness (traffic acceptance check with 503 during startup/shutdown), consistent with the implementation.

cmd/hyperfleet-api/server/metrics_server.go (1)

25-28: Health endpoints correctly integrated into metrics server.

Both LivenessHandler and ReadinessHandler methods exist on the Handler type with proper signatures for use as HTTP handlers. The handler is properly instantiated with the SessionFactory dependency, and both /healthz (liveness) and /readyz (readiness) endpoints are correctly registered with GET method restrictions. The ReadinessHandler includes database connectivity checks via sessionFactory.CheckConnection(), ensuring the endpoint accurately reflects system readiness status.

docs/hyperfleet-api.http (1)

195-200: LGTM!

The health endpoint documentation correctly reflects the consolidation to the metrics port (8080) with standard Kubernetes probe naming conventions (/healthz for liveness, /readyz for readiness). This aligns with the PR objective and replaces the previous /healthcheck endpoint. Based on learnings, the health check was previously on port 8083 and is now properly consolidated.

pkg/health/readiness_test.go (3)

7-17: LGTM!

Initial state test correctly validates the zero-value behavior of ReadinessState.


19-57: LGTM!

State transition tests (SetReady, SetNotReady, SetShuttingDown) comprehensively verify the state machine behavior, including the important invariant that IsReady() returns false when shutting down.


59-73: LGTM!

Reset() test properly verifies both flags are cleared.

pkg/health/handlers_test.go (4)

10-33: LGTM!

Liveness handler test correctly validates HTTP 200 response with JSON body containing status: "ok".


35-66: LGTM!

Readiness handler test properly validates the happy path with state reset and cleanup.


68-132: LGTM!

Error path tests comprehensively cover both "not ready" and "shutting down" scenarios, including the shutdown message verification.


170-209: LGTM!

Content-Type header tests for both handlers ensure proper JSON response formatting.

cmd/hyperfleet-api/servecmd/cmd.go (3)

47-54: LGTM!

Shutdown configuration binding and logging are properly implemented, providing visibility into the configured timeout.


83-90: LGTM!

The readiness signaling and signal handling are well-implemented. Using signal.NotifyContext is the idiomatic Go approach for graceful shutdown, and marking the application ready after servers start ensures probes reflect actual availability.


128-135: LGTM!

The shutdown completion/timeout handling with select is correct. The forced exit on timeout ensures the process doesn't hang indefinitely.

pkg/health/handlers.go (2)

32-43: LGTM!

Liveness handler correctly implements a minimal process-alive check that always returns 200 OK. This is appropriate for Kubernetes liveness probes, which should only fail if the process is unhealthy and needs to be restarted.


79-103: LGTM!

The readiness check logic correctly aggregates multiple health checks and returns appropriate status codes. The check ordering (shutdown → database → ready flag) ensures early exit for shutdown scenarios.

pkg/health/readiness.go (4)

14-23: LGTM!

The singleton pattern is appropriate for application-wide readiness state. Package-level initialization is thread-safe in Go, and the Reset() method addresses testability concerns.


25-37: LGTM!

Thread-safe setters with proper mutex locking. The implementation is correct and straightforward.


39-46: LGTM!

Atomically setting both shutting and ready under a single lock ensures consistent state during shutdown transitions. Well documented.


48-60: LGTM!

Correct use of RLock for concurrent read access. The !r.shutting check in IsReady() provides defensive protection against incorrect state transitions, ensuring readiness is never reported during shutdown even if SetReady() were mistakenly called after SetShuttingDown().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @cmd/hyperfleet-api/servecmd/cmd.go:
- Around line 101-126: The Stop calls are using unbounded contexts — update the
shutdown flow so apiServer.Stop and metricsServer.Stop accept a context (or at
least use the existing shutdownCtx) and use it when calling
http.Server.Shutdown; specifically change the Stop signatures to Stop(ctx
context.Context) (or add an overload) and invoke apiServer.Stop(shutdownCtx) and
metricsServer.Stop(shutdownCtx) from this goroutine (or, if you cannot change
signatures, have each Stop create a derived context with the same timeout as
shutdownCtx and use that for http.Server.Shutdown) so that the per-server
shutdown honors the timeout the caller created (shutdownCtx) just like
telemetry.Shutdown(shutdownCtx, tp).
🧹 Nitpick comments (1)
pkg/config/shutdown.go (1)

33-41: Silent error on invalid SHUTDOWN_TIMEOUT may confuse operators.

When time.ParseDuration fails, the error is silently ignored, and the timeout falls back to the default. This could mask misconfiguration (e.g., SHUTDOWN_TIMEOUT=20 without a unit).

Consider logging a warning when parsing fails so operators are aware their configuration was not applied.

🔧 Suggested improvement
 func (s *ShutdownConfig) BindEnv() {
 	if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" {
 		if d, err := time.ParseDuration(val); err == nil {
 			s.Timeout = d
+		} else {
+			// Log warning - requires importing logger or returning error
+			// Consider: log.Printf("warning: invalid SHUTDOWN_TIMEOUT %q, using default: %v", val, err)
 		}
 	}
 }

Alternatively, change the signature to BindEnv() error to propagate the error to the caller, consistent with how other config validation might be handled.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5acbbe9 and f7a96ea.

📒 Files selected for processing (6)
  • AGENTS.md
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/deployment.md
  • pkg/config/config.go
  • pkg/config/shutdown.go
  • pkg/config/shutdown_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/config/shutdown_test.go
  • AGENTS.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID

Files:

  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/config/shutdown.go
  • pkg/config/config.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/deployment.md
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/deployment.md
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/environments/**/*.go : Environment-based configuration: use OCM_ENV variable to select development, unit_testing, integration_testing, or production configuration

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.

Applied to files:

  • cmd/hyperfleet-api/servecmd/cmd.go
  • pkg/config/shutdown.go
  • pkg/config/config.go
🧬 Code graph analysis (2)
cmd/hyperfleet-api/servecmd/cmd.go (4)
cmd/hyperfleet-api/environments/framework.go (1)
  • Environment (55-57)
pkg/telemetry/otel.go (1)
  • Shutdown (67-72)
pkg/logger/logger.go (4)
  • With (276-281)
  • Info (253-255)
  • WithError (284-289)
  • Error (265-267)
pkg/health/readiness.go (1)
  • GetReadinessState (21-23)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
  • ShutdownConfig (15-17)
  • NewShutdownConfig (19-23)
🔇 Additional comments (11)
pkg/config/shutdown.go (2)

1-13: LGTM!

Package setup and constant definition are clean and well-documented. The 20-second default is a reasonable timeout for graceful shutdown.


15-27: LGTM!

The struct definition, constructor, and flag wiring follow the established patterns used by other config types in this package (e.g., ServerConfig, MetricsConfig).

pkg/config/config.go (3)

15-22: LGTM!

The Shutdown field is properly added to ApplicationConfig, replacing the removed HealthCheck field. The JSON tag follows the snake_case convention used by other fields.


24-43: LGTM!

Constructor and flag wiring are correctly updated to include ShutdownConfig initialization and flag binding, consistent with other config subsystems.


45-64: Shutdown.ReadFiles intentionally omitted from ReadFiles.

ShutdownConfig.ReadFiles() currently returns nil with no file-based configuration. Omitting it from this list is reasonable, but if file-based shutdown config is added later, remember to include it here.

cmd/hyperfleet-api/servecmd/cmd.go (4)

16-16: LGTM!

Health package import added for readiness state management.


47-54: LGTM!

Environment binding and structured logging of the shutdown timeout follow the codebase conventions. Based on learnings, this correctly uses logger.With(ctx, ...) for structured logging.


83-90: LGTM!

The readiness state is correctly set after servers start, and signal.NotifyContext is the idiomatic Go pattern for signal handling with context cancellation.


128-135: LGTM!

The select pattern correctly handles both graceful completion and timeout scenarios. Using os.Exit(1) on timeout ensures the process terminates even if shutdown hangs.

docs/deployment.md (2)

81-84: LGTM!

Documentation correctly reflects the updated port configuration and new SHUTDOWN_TIMEOUT environment variable with its default value.


277-279: LGTM!

Health check documentation accurately describes the consolidated endpoints on port 8080, matching the implementation. The distinction between liveness (/healthz) and readiness (/readyz) behavior is clearly explained. Based on learnings, this updates the previous documentation that referenced /healthcheck on a separate port.

Add configurable graceful shutdown with SHUTDOWN_TIMEOUT environment
variable (default 20s). Use signal.NotifyContext for SIGTERM/SIGINT
handling and implement proper shutdown sequence: mark not ready,
drain HTTP connections, cleanup OpenTelemetry, close database.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
test/helper.go (1)

109-116: Metrics server not included in teardowns.

The metrics server is started at line 116 but stopMetricsServer is not added to the teardowns slice. This leaves the metrics server goroutine running and port 8080 bound after tests complete.

🔧 Suggested fix
 	helper.teardowns = []func() error{
 		helper.CleanDB,
 		jwkMockTeardown,
 		helper.stopAPIServer,
+		helper.stopMetricsServer,
 		helper.teardownEnv,
 	}
pkg/config/config.go (1)

45-64: Add Shutdown.ReadFiles() to the ApplicationConfig.ReadFiles() method for consistency.

Shutdown.ReadFiles() is missing from the configuration loading in ReadFiles(), even though all other config types (Server, Database, OCM, Metrics, Logging) are included. For consistency, it should be added to the list, similar to Server, Metrics, and Logging which also return nil.

cmd/hyperfleet-api/server/api_server.go (1)

140-155: Remove database close from Start() method; it conflicts with graceful shutdown orchestration.

The Start() method closes the database connection after Serve() returns (lines 152-154). During graceful shutdown, apiServer.Stop(ctx) is called from cmd.go:109, which triggers HTTP server shutdown and causes Serve() to return. This immediately closes the database. Subsequently, cmd.go:124 calls environments.Environment().Teardown(), which closes the database again. This results in a double-close of the SessionFactory during every graceful shutdown, violating the code's own comment that this must be called "exactly once during an app's lifetime."

Move database cleanup responsibility entirely to the shutdown orchestrator in cmd.go (already calling Teardown() at line 124) and remove the redundant close from the Start() method.

🧹 Nitpick comments (1)
pkg/config/shutdown.go (1)

33-41: Silent failure on invalid SHUTDOWN_TIMEOUT value.

If the environment variable contains an invalid duration string, the parse error is silently ignored and the default/flag value is retained. This could lead to unexpected behavior if an operator misconfigures the timeout (e.g., SHUTDOWN_TIMEOUT=20 without a unit).

Consider logging a warning when parsing fails so operators are aware of the misconfiguration.

♻️ Suggested improvement
+import (
+	"context"
+	"os"
+	"time"
+
+	"github.com/openshift-hyperfleet/hyperfleet-api/pkg/logger"
+	"github.com/spf13/pflag"
+)

 func (s *ShutdownConfig) BindEnv() {
 	if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" {
 		if d, err := time.ParseDuration(val); err == nil {
 			s.Timeout = d
+		} else {
+			logger.Warn(context.Background(), "Invalid SHUTDOWN_TIMEOUT value, using default", "value", val, "error", err)
 		}
 	}
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7a96ea and 28df995.

📒 Files selected for processing (10)
  • AGENTS.md
  • cmd/hyperfleet-api/servecmd/cmd.go
  • cmd/hyperfleet-api/server/api_server.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • cmd/hyperfleet-api/server/server.go
  • docs/deployment.md
  • pkg/config/config.go
  • pkg/config/shutdown.go
  • pkg/config/shutdown_test.go
  • test/helper.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/config/shutdown_test.go
  • cmd/hyperfleet-api/servecmd/cmd.go
  • docs/deployment.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID

Files:

  • cmd/hyperfleet-api/server/server.go
  • cmd/hyperfleet-api/server/api_server.go
  • test/helper.go
  • pkg/config/shutdown.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/config/config.go
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags

Applied to files:

  • cmd/hyperfleet-api/server/server.go
  • cmd/hyperfleet-api/server/api_server.go
  • test/helper.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • AGENTS.md
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.

Applied to files:

  • cmd/hyperfleet-api/server/server.go
  • cmd/hyperfleet-api/server/api_server.go
  • test/helper.go
  • pkg/config/shutdown.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • pkg/config/config.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes

Applied to files:

  • test/helper.go
  • cmd/hyperfleet-api/server/metrics_server.go
  • AGENTS.md
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Prometheus metrics are exposed at GET /metrics endpoint for monitoring; configure metric collection via --label-metrics-inclusion-duration

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : HTTP handlers must follow the pattern: parse request → call service/DAO → handle errors → send response with appropriate HTTP status codes

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept resource creation/update/delete requests via REST endpoints, validate, and persist to PostgreSQL

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,dao}/**/*.go : Errors must use structured ServiceError type from pkg/errors/ with HttpCode, Code, and Reason fields; send via errors.SendError(w, r, err)

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Serve resource data to Sentinel via GET /{resourceType} endpoints with pagination and filtering support

Applied to files:

  • cmd/hyperfleet-api/server/metrics_server.go
🧬 Code graph analysis (4)
cmd/hyperfleet-api/server/api_server.go (1)
pkg/telemetry/otel.go (1)
  • Shutdown (67-72)
test/helper.go (5)
pkg/db/session.go (1)
  • SessionFactory (12-20)
pkg/config/config.go (1)
  • ApplicationConfig (15-22)
cmd/hyperfleet-api/environments/types.go (1)
  • ApplicationConfig (32-34)
cmd/hyperfleet-api/server/server.go (1)
  • Server (13-18)
test/factories/factory.go (1)
  • Factories (9-10)
cmd/hyperfleet-api/server/metrics_server.go (3)
pkg/health/handlers.go (1)
  • NewHandler (19-23)
cmd/hyperfleet-api/environments/types.go (1)
  • Database (36-38)
pkg/db/session.go (1)
  • SessionFactory (12-20)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
  • ShutdownConfig (15-17)
  • NewShutdownConfig (19-23)
🔇 Additional comments (15)
test/helper.go (2)

56-68: LGTM!

The Helper struct cleanly removes the HealthCheckServer field, aligning with the PR's consolidation of health endpoints onto the metrics server.


157-179: LGTM!

The Stop calls now correctly pass context.Background() to match the updated Server interface signature. Using context.Background() is appropriate for test teardown where there's no parent context or timeout requirement.

AGENTS.md (3)

347-352: LGTM!

Documentation accurately reflects the consolidation of health endpoints onto the metrics server and the new graceful shutdown configuration. The --shutdown-timeout flag with SHUTDOWN_TIMEOUT environment variable is clearly documented.


690-691: LGTM!

Server configuration section correctly updated to show all three endpoints (/metrics, /healthz, /readyz) served on port 8080.


775-778: LGTM!

Health endpoints documentation is clear and accurate:

  • /healthz for liveness (process alive check)
  • /readyz for readiness (database connection check)

This aligns with Kubernetes probe conventions and the implementation in pkg/health/handlers.go.

pkg/config/shutdown.go (2)

10-13: LGTM!

The 20-second default shutdown timeout is a reasonable choice that provides adequate time for connection draining while not extending shutdown unnecessarily.


15-23: LGTM!

Clean struct definition and constructor following the established config pattern in the codebase.

pkg/config/config.go (3)

15-22: LGTM!

The Shutdown field is correctly added to ApplicationConfig, replacing the removed HealthCheck configuration. The struct maintains consistency with other config fields.


24-33: LGTM!

Constructor properly initializes Shutdown via NewShutdownConfig(), ensuring the default timeout is set.


35-43: LGTM!

The --shutdown-timeout flag is correctly wired through c.Shutdown.AddFlags(flagset).

cmd/hyperfleet-api/server/metrics_server.go (2)

25-29: LGTM!

Health endpoints are correctly integrated:

  • /healthz for liveness probe (process alive)
  • /readyz for readiness probe (database connectivity)
  • Handler properly initialized with SessionFactory for database health checks
  • HTTP methods correctly restricted to GET

This consolidates health checks onto the metrics server as intended. Based on learnings, this replaces the previous /healthcheck endpoint on port 8083.


77-79: LGTM!

The Stop method correctly accepts a context and passes it to http.Server.Shutdown(ctx), enabling graceful shutdown with timeout control from the caller.

cmd/hyperfleet-api/server/server.go (2)

13-18: LGTM! Context-aware Stop signature enables proper graceful shutdown.

Adding context.Context to the Stop method allows callers to control shutdown timeouts and cancellation, which is the correct pattern for graceful shutdown in Go.


27-34: Correct handling of server closed error.

The check function appropriately filters out http.ErrServerClosed, which is expected during graceful shutdown and should not cause an exit.

cmd/hyperfleet-api/server/api_server.go (1)

157-158: LGTM! Context is correctly propagated to HTTP server shutdown.

The implementation properly passes the context to httpServer.Shutdown(ctx), allowing the graceful shutdown timeout configured via SHUTDOWN_TIMEOUT to be respected. This is consistent with similar patterns in pkg/telemetry/otel.go:Shutdown.

initialDelaySeconds: 30
periodSeconds: 10
path: /healthz
port: metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest separating health endpoints from the metrics port and adopting a clear three-port layout:

  1. Port 8000: API server
  2. Port 8080: Metrics server (/metrics only, for Prometheus)
  3. Port 8081: Health server (/healthz, /readyz, for kubelet)

Benefit:

  • Single responsibility: Each port serves a well-defined purpose, which makes the runtime model easier to reason about and maintain.
  • Failure isolation: Metrics collection issues (e.g., slow scrapes or handler contention) won’t impact health probes, and probe failures won’t interfere with observability.
  • Network policy clarity: Clean separation allows tighter and more explicit network policies (Prometheus → metrics, kubelet → health), reducing accidental coupling.

The current two-port setup is functional, but the three-port architecture provides clearer separation of concerns with minimal additional complexity and better long-term maintainability.

}
} else {
checks["database"] = "ok"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there is a potential risk:

Network partition between application and database:

  1. CheckConnection() attempts TCP connection to database
  2. Network is unreachable (e.g., security group change, AZ failure)
  3. TCP connection blocks until system timeout (~75-120s on Linux)
  4. Kubelet's readiness probe times out after 3s and closes HTTP connection
  5. But the handler goroutine continues running for 120s waiting for TCP timeout
  6. Next probe (10s later) creates another leaked goroutine
  7. Accumulates ~12 leaked goroutines per 2 minutes during network issues

Impact

  • Resource leak: Goroutines accumulate, potentially causing memory pressure
  • Connection pool exhaustion: Each leaked goroutine may hold a DB connection attempt
  • Pod marked NotReady: Service degradation when all pods affected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants