-
Notifications
You must be signed in to change notification settings - Fork 8
HYPERFLEET-453, HYPERFLEET-454 - Implement health endpoints and graceful shutdown #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Consolidate health endpoints on metrics server (port 8080) following HyperFleet health-endpoints.md standard: - Add /healthz (liveness) and /readyz (readiness) endpoints - Remove separate healthcheck server (port 8083) - Integrate readiness state with graceful shutdown - Update Kubernetes probes in deployment.yaml - Update documentation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThis PR removes the standalone health-check server and its config, consolidates health endpoints into the metrics server on the metrics port (:8080), and replaces /healthcheck with two probes: GET /healthz (liveness) and GET /readyz (readiness). It adds a readiness state manager and health handlers (including optional DB checks), introduces a Shutdown configuration (shutdown-timeout, env SHUTDOWN_TIMEOUT), updates servers/interfaces to accept context-based Stop(ctx), and implements a coordinated graceful shutdown sequence. Kubernetes manifests, charts, tests, and documentation are updated to reflect these changes. Sequence Diagram(s)sequenceDiagram
participant Client
participant MetricsServer
participant ReadinessState
participant Database
participant App
App->>ReadinessState: SetNotReady()
App->>MetricsServer: Start(:8080)
App->>ReadinessState: SetReady()
Client->>MetricsServer: GET /healthz
MetricsServer->>MetricsServer: LivenessHandler -> 200
MetricsServer-->>Client: 200
Client->>MetricsServer: GET /readyz
MetricsServer->>ReadinessState: IsReady()
MetricsServer->>Database: Optional DB check
MetricsServer-->>Client: 200 or 503 (with checks)
Client->>App: SIGTERM
App->>ReadinessState: SetShuttingDown()
App->>MetricsServer: Stop(ctx) (graceful)
App->>Database: Teardown
App-->>Client: Exit or forced after timeout
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/helper.go (1)
109-116: Metrics server not stopped during teardown.
startMetricsServer()is called at line 116, butstopMetricsServeris not added to theteardownsslice. This means the metrics server will continue running after test teardown, potentially causing resource leaks or port conflicts in subsequent test runs.🐛 Proposed fix
helper.teardowns = []func() error{ helper.CleanDB, jwkMockTeardown, helper.stopAPIServer, + helper.stopMetricsServer, helper.teardownEnv, } helper.startAPIServer() helper.startMetricsServer()
🤖 Fix all issues with AI agents
In @cmd/hyperfleet-api/servecmd/cmd.go:
- Around line 116-121: The telemetry shutdown uses context.Background() which
ignores the overall shutdown timeout; change the call to use shutdownCtx (the
cancellation context that enforces the shutdown deadline) so
telemetry.Shutdown(shutdownCtx, tp) is invoked (keep the tp nil check) and
retain the existing logging (logger.WithError(ctx, err).Error(...)) so errors
are reported if the shutdown is cancelled or fails.
🧹 Nitpick comments (8)
pkg/config/shutdown.go (1)
33-41: Consider logging a warning for invalidSHUTDOWN_TIMEOUTvalues.When
time.ParseDurationfails, the invalid value is silently ignored. This could make debugging configuration issues harder if an operator sets an invalid value likeSHUTDOWN_TIMEOUT=20(missing unit suffix).♻️ Optional improvement to add a warning log
+import ( + "log/slog" + "os" + "time" + + "github.com/spf13/pflag" +) // BindEnv binds environment variables to config values // Environment variables take precedence over flags func (s *ShutdownConfig) BindEnv() { if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" { if d, err := time.ParseDuration(val); err == nil { s.Timeout = d + } else { + slog.Warn("invalid SHUTDOWN_TIMEOUT value, using default", + "value", val, + "default", s.Timeout, + "error", err) } } }pkg/config/shutdown_test.go (1)
10-20: Minor redundancy in default value assertions.Lines 13-15 and 17-19 both verify the same default timeout value. The second check (explicit
20*time.Second) is redundant sinceDefaultShutdownTimeoutis defined as20 * time.Second.♻️ Optional: Remove redundant assertion
func TestNewShutdownConfig_Defaults(t *testing.T) { cfg := NewShutdownConfig() if cfg.Timeout != DefaultShutdownTimeout { t.Errorf("expected default timeout %v, got %v", DefaultShutdownTimeout, cfg.Timeout) } - - if cfg.Timeout != 20*time.Second { - t.Errorf("expected timeout 20s, got %v", cfg.Timeout) - } }pkg/health/readiness_test.go (1)
93-100: Consider adding test isolation for the singleton test.This test accesses the global singleton but doesn't reset its state. While it only performs read operations, if the singleton is modified by previous tests (e.g., in parallel execution), it could still pass but leave the global state in an unexpected condition. Consider calling
Reset()at the end for consistency with other tests.♻️ Suggested improvement
func TestGetReadinessState_ReturnsSingleton(t *testing.T) { state1 := GetReadinessState() state2 := GetReadinessState() if state1 != state2 { t.Error("expected GetReadinessState() to return the same instance") } + + // Reset state for other tests + state1.Reset() }pkg/health/handlers_test.go (1)
134-168: Consider adding a test for database connection failure.The
TestReadinessHandler_IncludesCheckstest uses anilsessionFactory, which causes the handler to reportdatabase: "ok"via the fallback path. Consider adding a test with a mockSessionFactorythat returns an error fromCheckConnection()to verify thedatabase: "error"path and ensure 503 is returned when the database is unreachable.pkg/health/handlers.go (2)
10-11: Unused type definition.
CheckFuncis defined but not used anywhere in this file or the tests. If this is intended for future extensibility (e.g., registering custom health checks), consider adding a TODO comment or removing it until needed.
67-77: Consider reporting "skipped" instead of "ok" when sessionFactory is nil.When
sessionFactoryis nil, the handler reportsdatabase: "ok", which could mask a misconfiguration where the database dependency wasn't properly injected. Reporting"skipped"or"not_configured"would provide clearer diagnostics.♻️ Suggested improvement
// Check database connection if h.sessionFactory != nil { if err := h.sessionFactory.CheckConnection(); err != nil { checks["database"] = "error" allOK = false } else { checks["database"] = "ok" } } else { - checks["database"] = "ok" + checks["database"] = "skipped" }pkg/health/readiness.go (2)
8-12: Consider renamingshuttingtoshuttingDownfor clarity.The field name
shuttingis less descriptive than the methodIsShuttingDown()andSetShuttingDown(). Aligning the field name would improve code readability.Suggested improvement
type ReadinessState struct { mu sync.RWMutex ready bool - shutting bool + shuttingDown bool }Then update references in methods accordingly.
62-68: Consider adding documentation about production usage.The
Reset()method is correctly implemented for testing purposes. Consider strengthening the documentation to warn against calling this in production code, as resetting the shutdown state mid-flight could cause unexpected behavior.Suggested documentation improvement
-// Reset resets the readiness state (useful for testing) +// Reset resets the readiness state to initial values. +// WARNING: This method is intended for testing only. Do not call in production +// as it may cause inconsistent state during graceful shutdown. func (r *ReadinessState) Reset() {
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
AGENTS.mdREADME.mdcharts/templates/deployment.yamlcharts/templates/service.yamlcharts/values.yamlcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/healthcheck_server.gocmd/hyperfleet-api/server/logging/request_logging_middleware.gocmd/hyperfleet-api/server/metrics_server.godocs/deployment.mddocs/development.mddocs/hyperfleet-api.httppkg/config/config.gopkg/config/health_check.gopkg/config/shutdown.gopkg/config/shutdown_test.gopkg/health/handlers.gopkg/health/handlers_test.gopkg/health/readiness.gopkg/health/readiness_test.gotest/helper.go
💤 Files with no reviewable changes (5)
- charts/values.yaml
- pkg/config/health_check.go
- cmd/hyperfleet-api/server/logging/request_logging_middleware.go
- charts/templates/service.yaml
- cmd/hyperfleet-api/server/healthcheck_server.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID
Files:
pkg/health/handlers_test.gopkg/config/shutdown.gopkg/health/readiness.gocmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.gopkg/health/readiness_test.gotest/helper.gopkg/health/handlers.gopkg/config/config.gopkg/config/shutdown_test.go
🧠 Learnings (17)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags
Applied to files:
docs/development.mddocs/deployment.mdcmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.gocharts/templates/deployment.yamlREADME.mdtest/helper.goAGENTS.mddocs/hyperfleet-api.http
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
Applied to files:
docs/development.mdpkg/health/handlers_test.godocs/deployment.mdpkg/health/readiness.gocmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.gocharts/templates/deployment.yamlpkg/health/readiness_test.goREADME.mdtest/helper.gopkg/health/handlers.goAGENTS.mddocs/hyperfleet-api.http
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : HTTP handlers must follow the pattern: parse request → call service/DAO → handle errors → send response with appropriate HTTP status codes
Applied to files:
pkg/health/handlers_test.gocmd/hyperfleet-api/server/metrics_server.gopkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Prometheus metrics are exposed at GET /metrics endpoint for monitoring; configure metric collection via --label-metrics-inclusion-duration
Applied to files:
pkg/health/handlers_test.gocmd/hyperfleet-api/server/metrics_server.gopkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept resource creation/update/delete requests via REST endpoints, validate, and persist to PostgreSQL
Applied to files:
pkg/health/handlers_test.gocmd/hyperfleet-api/server/metrics_server.gopkg/health/handlers.go
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.
Applied to files:
pkg/health/handlers_test.gopkg/config/shutdown.gopkg/health/readiness.gocmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.gopkg/health/readiness_test.gotest/helper.gopkg/health/handlers.gopkg/config/config.gopkg/config/shutdown_test.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/services/**/*.go : Status aggregation: Phase is 'Ready' if all adapters report Ready=True, 'Failed' if any adapter reports Ready=False, 'NotReady' otherwise
Applied to files:
pkg/health/readiness.gopkg/health/readiness_test.go
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.
Applied to files:
cmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment
Applied to files:
cmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations
Applied to files:
cmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,dao}/**/*.go : Errors must use structured ServiceError type from pkg/errors/ with HttpCode, Code, and Reason fields; send via errors.SendError(w, r, err)
Applied to files:
cmd/hyperfleet-api/server/metrics_server.gopkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Serve resource data to Sentinel via GET /{resourceType} endpoints with pagination and filtering support
Applied to files:
cmd/hyperfleet-api/server/metrics_server.gopkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/environments/**/*.go : Environment-based configuration: use OCM_ENV variable to select development, unit_testing, integration_testing, or production configuration
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,db}/**/*.go : Use context values for passing request-scoped data (session, operation ID, user info) rather than function parameters or global state
Applied to files:
pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Database transactions are automatically managed by TransactionMiddleware for all HTTP requests; store session in context and auto-commit on success, rollback on error
Applied to files:
pkg/health/handlers.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept status updates from adapters via POST /{resourceType}/{id}/statuses and store in adapter_statuses table
Applied to files:
pkg/health/handlers.go
🧬 Code graph analysis (6)
cmd/hyperfleet-api/server/metrics_server.go (3)
pkg/health/handlers.go (1)
NewHandler(19-23)cmd/hyperfleet-api/environments/types.go (1)
Database(36-38)pkg/db/session.go (1)
SessionFactory(12-20)
pkg/health/readiness_test.go (1)
pkg/health/readiness.go (2)
ReadinessState(8-12)GetReadinessState(21-23)
test/helper.go (5)
pkg/db/session.go (1)
SessionFactory(12-20)pkg/config/config.go (1)
ApplicationConfig(15-22)cmd/hyperfleet-api/environments/types.go (1)
ApplicationConfig(32-34)cmd/hyperfleet-api/server/server.go (1)
Server(13-18)test/factories/factory.go (1)
Factories(9-10)
pkg/health/handlers.go (2)
pkg/db/session.go (1)
SessionFactory(12-20)pkg/health/readiness.go (1)
GetReadinessState(21-23)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
ShutdownConfig(15-17)NewShutdownConfig(19-23)
pkg/config/shutdown_test.go (1)
pkg/config/shutdown.go (2)
NewShutdownConfig(19-23)DefaultShutdownTimeout(12-12)
🔇 Additional comments (37)
test/helper.go (2)
56-68: LGTM!The
Helperstruct field formatting is clean. The removal ofHealthCheckServerand related fields aligns with the PR objective to consolidate health endpoints onto the metrics server.
164-179: LGTM!The
startMetricsServerandstopMetricsServermethods follow the same pattern as the API server methods, which is consistent and maintainable.AGENTS.md (3)
347-352: LGTM!Documentation accurately reflects the consolidation of health endpoints onto the metrics server and the new graceful shutdown configuration. The
shutdown-timeoutflag with environment variable binding (SHUTDOWN_TIMEOUT) is properly documented.
690-690: LGTM!Port documentation correctly updated to reflect the consolidated health and metrics endpoints on port 8080. Based on learnings, this updates the previous port 8083 health check configuration.
775-778: LGTM!Health endpoints documentation correctly describes
/healthzfor liveness and/readyzfor readiness probes. This aligns with Kubernetes conventions and the PR objectives.pkg/config/shutdown.go (2)
10-23: LGTM!The default shutdown timeout of 20 seconds is a reasonable value that allows for connection draining while not delaying shutdown too long. The struct and constructor follow the established patterns in the codebase.
25-27: LGTM!Flag registration follows the same pattern as other config types in the codebase.
pkg/config/config.go (4)
15-22: LGTM!The
ApplicationConfigstruct correctly replacesHealthCheckwithShutdown, aligning with the PR's consolidation of health endpoints into the metrics server and addition of graceful shutdown configuration.
24-33: LGTM!Constructor properly initializes all config components including the new
ShutdownConfig.
35-43: LGTM!
AddFlagscorrectly includesShutdown.AddFlags, ensuring the--shutdown-timeoutflag is registered.
45-55: Intentional omission:Shutdown.ReadFilesnot called.Note that
Shutdown.ReadFiles()is not included in thereadFilesslice. This is acceptable sinceShutdownConfig.ReadFiles()currently returnsnil(no file-based configuration). If file-based shutdown configuration is added later, remember to wire it here.pkg/config/shutdown_test.go (3)
22-52: LGTM!Flag registration and parsing tests are thorough, verifying both the default value and custom value parsing via command-line arguments.
54-99: LGTM!Excellent table-driven test coverage for
BindEnv. The test cases cover valid durations (seconds and minutes), invalid duration strings, and empty values. Usingt.Setenvensures proper test isolation.
101-108: LGTM!Simple but necessary test ensuring
ReadFilesdoesn't return an error, which is important for interface compliance even though it's currently a no-op.docs/development.md (1)
98-100: Documentation correctly updated to reflect consolidated health endpoints.The URLs now accurately point to the new
/healthzand/readyzendpoints on port 8080, consistent with the code changes that consolidate health checks onto the metrics server.README.md (1)
83-85: README correctly documents the new health probe endpoints.The liveness and readiness probe URLs are accurate and consistent with the implementation in
metrics_server.goand other documentation files.charts/templates/deployment.yaml (1)
79-94: Kubernetes probes correctly configured for consolidated health endpoints.The liveness (
/healthz) and readiness (/readyz) probes are properly configured to use themetricsport. The timing parameters are reasonable:
- Liveness:
initialDelaySeconds=15,periodSeconds=20— avoids premature pod restarts during startup- Readiness:
initialDelaySeconds=5,periodSeconds=10— enables quick traffic routing once readyThis configuration aligns with the PR's goal of consolidating health and metrics endpoints on port 8080.
docs/deployment.md (2)
81-84: Environment variable documentation accurately reflects new configuration.The
METRICS_PORTdescription correctly notes it now serves both metrics and health endpoints. The newSHUTDOWN_TIMEOUTvariable is properly documented with its default value of20s.
277-279: Health check documentation clearly describes probe behavior.The descriptions accurately differentiate liveness (process alive check) from readiness (traffic acceptance check with 503 during startup/shutdown), consistent with the implementation.
cmd/hyperfleet-api/server/metrics_server.go (1)
25-28: Health endpoints correctly integrated into metrics server.Both
LivenessHandlerandReadinessHandlermethods exist on theHandlertype with proper signatures for use as HTTP handlers. The handler is properly instantiated with theSessionFactorydependency, and both/healthz(liveness) and/readyz(readiness) endpoints are correctly registered with GET method restrictions. TheReadinessHandlerincludes database connectivity checks viasessionFactory.CheckConnection(), ensuring the endpoint accurately reflects system readiness status.docs/hyperfleet-api.http (1)
195-200: LGTM!The health endpoint documentation correctly reflects the consolidation to the metrics port (8080) with standard Kubernetes probe naming conventions (
/healthzfor liveness,/readyzfor readiness). This aligns with the PR objective and replaces the previous/healthcheckendpoint. Based on learnings, the health check was previously on port 8083 and is now properly consolidated.pkg/health/readiness_test.go (3)
7-17: LGTM!Initial state test correctly validates the zero-value behavior of
ReadinessState.
19-57: LGTM!State transition tests (
SetReady,SetNotReady,SetShuttingDown) comprehensively verify the state machine behavior, including the important invariant thatIsReady()returns false when shutting down.
59-73: LGTM!
Reset()test properly verifies both flags are cleared.pkg/health/handlers_test.go (4)
10-33: LGTM!Liveness handler test correctly validates HTTP 200 response with JSON body containing
status: "ok".
35-66: LGTM!Readiness handler test properly validates the happy path with state reset and cleanup.
68-132: LGTM!Error path tests comprehensively cover both "not ready" and "shutting down" scenarios, including the shutdown message verification.
170-209: LGTM!Content-Type header tests for both handlers ensure proper JSON response formatting.
cmd/hyperfleet-api/servecmd/cmd.go (3)
47-54: LGTM!Shutdown configuration binding and logging are properly implemented, providing visibility into the configured timeout.
83-90: LGTM!The readiness signaling and signal handling are well-implemented. Using
signal.NotifyContextis the idiomatic Go approach for graceful shutdown, and marking the application ready after servers start ensures probes reflect actual availability.
128-135: LGTM!The shutdown completion/timeout handling with select is correct. The forced exit on timeout ensures the process doesn't hang indefinitely.
pkg/health/handlers.go (2)
32-43: LGTM!Liveness handler correctly implements a minimal process-alive check that always returns 200 OK. This is appropriate for Kubernetes liveness probes, which should only fail if the process is unhealthy and needs to be restarted.
79-103: LGTM!The readiness check logic correctly aggregates multiple health checks and returns appropriate status codes. The check ordering (shutdown → database → ready flag) ensures early exit for shutdown scenarios.
pkg/health/readiness.go (4)
14-23: LGTM!The singleton pattern is appropriate for application-wide readiness state. Package-level initialization is thread-safe in Go, and the
Reset()method addresses testability concerns.
25-37: LGTM!Thread-safe setters with proper mutex locking. The implementation is correct and straightforward.
39-46: LGTM!Atomically setting both
shuttingandreadyunder a single lock ensures consistent state during shutdown transitions. Well documented.
48-60: LGTM!Correct use of
RLockfor concurrent read access. The!r.shuttingcheck inIsReady()provides defensive protection against incorrect state transitions, ensuring readiness is never reported during shutdown even ifSetReady()were mistakenly called afterSetShuttingDown().
5acbbe9 to
f7a96ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @cmd/hyperfleet-api/servecmd/cmd.go:
- Around line 101-126: The Stop calls are using unbounded contexts — update the
shutdown flow so apiServer.Stop and metricsServer.Stop accept a context (or at
least use the existing shutdownCtx) and use it when calling
http.Server.Shutdown; specifically change the Stop signatures to Stop(ctx
context.Context) (or add an overload) and invoke apiServer.Stop(shutdownCtx) and
metricsServer.Stop(shutdownCtx) from this goroutine (or, if you cannot change
signatures, have each Stop create a derived context with the same timeout as
shutdownCtx and use that for http.Server.Shutdown) so that the per-server
shutdown honors the timeout the caller created (shutdownCtx) just like
telemetry.Shutdown(shutdownCtx, tp).
🧹 Nitpick comments (1)
pkg/config/shutdown.go (1)
33-41: Silent error on invalidSHUTDOWN_TIMEOUTmay confuse operators.When
time.ParseDurationfails, the error is silently ignored, and the timeout falls back to the default. This could mask misconfiguration (e.g.,SHUTDOWN_TIMEOUT=20without a unit).Consider logging a warning when parsing fails so operators are aware their configuration was not applied.
🔧 Suggested improvement
func (s *ShutdownConfig) BindEnv() { if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" { if d, err := time.ParseDuration(val); err == nil { s.Timeout = d + } else { + // Log warning - requires importing logger or returning error + // Consider: log.Printf("warning: invalid SHUTDOWN_TIMEOUT %q, using default: %v", val, err) } } }Alternatively, change the signature to
BindEnv() errorto propagate the error to the caller, consistent with how other config validation might be handled.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
AGENTS.mdcmd/hyperfleet-api/servecmd/cmd.godocs/deployment.mdpkg/config/config.gopkg/config/shutdown.gopkg/config/shutdown_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/config/shutdown_test.go
- AGENTS.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID
Files:
cmd/hyperfleet-api/servecmd/cmd.gopkg/config/shutdown.gopkg/config/config.go
🧠 Learnings (8)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.godocs/deployment.md
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.godocs/deployment.md
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/environments/**/*.go : Environment-based configuration: use OCM_ENV variable to select development, unit_testing, integration_testing, or production configuration
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.go
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.
Applied to files:
cmd/hyperfleet-api/servecmd/cmd.gopkg/config/shutdown.gopkg/config/config.go
🧬 Code graph analysis (2)
cmd/hyperfleet-api/servecmd/cmd.go (4)
cmd/hyperfleet-api/environments/framework.go (1)
Environment(55-57)pkg/telemetry/otel.go (1)
Shutdown(67-72)pkg/logger/logger.go (4)
With(276-281)Info(253-255)WithError(284-289)Error(265-267)pkg/health/readiness.go (1)
GetReadinessState(21-23)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
ShutdownConfig(15-17)NewShutdownConfig(19-23)
🔇 Additional comments (11)
pkg/config/shutdown.go (2)
1-13: LGTM!Package setup and constant definition are clean and well-documented. The 20-second default is a reasonable timeout for graceful shutdown.
15-27: LGTM!The struct definition, constructor, and flag wiring follow the established patterns used by other config types in this package (e.g.,
ServerConfig,MetricsConfig).pkg/config/config.go (3)
15-22: LGTM!The
Shutdownfield is properly added toApplicationConfig, replacing the removedHealthCheckfield. The JSON tag follows the snake_case convention used by other fields.
24-43: LGTM!Constructor and flag wiring are correctly updated to include
ShutdownConfiginitialization and flag binding, consistent with other config subsystems.
45-64:Shutdown.ReadFilesintentionally omitted fromReadFiles.
ShutdownConfig.ReadFiles()currently returnsnilwith no file-based configuration. Omitting it from this list is reasonable, but if file-based shutdown config is added later, remember to include it here.cmd/hyperfleet-api/servecmd/cmd.go (4)
16-16: LGTM!Health package import added for readiness state management.
47-54: LGTM!Environment binding and structured logging of the shutdown timeout follow the codebase conventions. Based on learnings, this correctly uses
logger.With(ctx, ...)for structured logging.
83-90: LGTM!The readiness state is correctly set after servers start, and
signal.NotifyContextis the idiomatic Go pattern for signal handling with context cancellation.
128-135: LGTM!The select pattern correctly handles both graceful completion and timeout scenarios. Using
os.Exit(1)on timeout ensures the process terminates even if shutdown hangs.docs/deployment.md (2)
81-84: LGTM!Documentation correctly reflects the updated port configuration and new
SHUTDOWN_TIMEOUTenvironment variable with its default value.
277-279: LGTM!Health check documentation accurately describes the consolidated endpoints on port 8080, matching the implementation. The distinction between liveness (
/healthz) and readiness (/readyz) behavior is clearly explained. Based on learnings, this updates the previous documentation that referenced/healthcheckon a separate port.
Add configurable graceful shutdown with SHUTDOWN_TIMEOUT environment variable (default 20s). Use signal.NotifyContext for SIGTERM/SIGINT handling and implement proper shutdown sequence: mark not ready, drain HTTP connections, cleanup OpenTelemetry, close database.
f7a96ea to
28df995
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
test/helper.go (1)
109-116: Metrics server not included in teardowns.The metrics server is started at line 116 but
stopMetricsServeris not added to theteardownsslice. This leaves the metrics server goroutine running and port 8080 bound after tests complete.🔧 Suggested fix
helper.teardowns = []func() error{ helper.CleanDB, jwkMockTeardown, helper.stopAPIServer, + helper.stopMetricsServer, helper.teardownEnv, }pkg/config/config.go (1)
45-64: AddShutdown.ReadFiles()to theApplicationConfig.ReadFiles()method for consistency.
Shutdown.ReadFiles()is missing from the configuration loading inReadFiles(), even though all other config types (Server, Database, OCM, Metrics, Logging) are included. For consistency, it should be added to the list, similar to Server, Metrics, and Logging which also returnnil.cmd/hyperfleet-api/server/api_server.go (1)
140-155: Remove database close from Start() method; it conflicts with graceful shutdown orchestration.The
Start()method closes the database connection afterServe()returns (lines 152-154). During graceful shutdown,apiServer.Stop(ctx)is called fromcmd.go:109, which triggers HTTP server shutdown and causesServe()to return. This immediately closes the database. Subsequently,cmd.go:124callsenvironments.Environment().Teardown(), which closes the database again. This results in a double-close of the SessionFactory during every graceful shutdown, violating the code's own comment that this must be called "exactly once during an app's lifetime."Move database cleanup responsibility entirely to the shutdown orchestrator in
cmd.go(already callingTeardown()at line 124) and remove the redundant close from theStart()method.
🧹 Nitpick comments (1)
pkg/config/shutdown.go (1)
33-41: Silent failure on invalidSHUTDOWN_TIMEOUTvalue.If the environment variable contains an invalid duration string, the parse error is silently ignored and the default/flag value is retained. This could lead to unexpected behavior if an operator misconfigures the timeout (e.g.,
SHUTDOWN_TIMEOUT=20without a unit).Consider logging a warning when parsing fails so operators are aware of the misconfiguration.
♻️ Suggested improvement
+import ( + "context" + "os" + "time" + + "github.com/openshift-hyperfleet/hyperfleet-api/pkg/logger" + "github.com/spf13/pflag" +) func (s *ShutdownConfig) BindEnv() { if val := os.Getenv("SHUTDOWN_TIMEOUT"); val != "" { if d, err := time.ParseDuration(val); err == nil { s.Timeout = d + } else { + logger.Warn(context.Background(), "Invalid SHUTDOWN_TIMEOUT value, using default", "value", val, "error", err) } } }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
AGENTS.mdcmd/hyperfleet-api/servecmd/cmd.gocmd/hyperfleet-api/server/api_server.gocmd/hyperfleet-api/server/metrics_server.gocmd/hyperfleet-api/server/server.godocs/deployment.mdpkg/config/config.gopkg/config/shutdown.gopkg/config/shutdown_test.gotest/helper.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/config/shutdown_test.go
- cmd/hyperfleet-api/servecmd/cmd.go
- docs/deployment.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use Go 1.24.9 with FIPS-compliant crypto enabled (CGO_ENABLED=1, GOEXPERIMENT=boringcrypto)
Use structured logging via logger.NewOCMLogger(ctx) with context-aware log fields including operation ID, account ID, and transaction ID
Files:
cmd/hyperfleet-api/server/server.gocmd/hyperfleet-api/server/api_server.gotest/helper.gopkg/config/shutdown.gocmd/hyperfleet-api/server/metrics_server.gopkg/config/config.go
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/serve/**/*.go : API server runs on port 8000, metrics endpoint on 8080, health check on 8083; these are configurable via CLI flags
Applied to files:
cmd/hyperfleet-api/server/server.gocmd/hyperfleet-api/server/api_server.gotest/helper.gocmd/hyperfleet-api/server/metrics_server.goAGENTS.md
📚 Learning: 2026-01-13T04:16:08.743Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/middleware/masking.go:28-36
Timestamp: 2026-01-13T04:16:08.743Z
Learning: In the openshift-hyperfleet/hyperfleet-api repository, NewOCMLogger(ctx) is deprecated. Switch to the slog-based logger and use logger.With(ctx, ...) to produce structured, context-aware logs. Replace occurrences of NewOCMLogger(ctx) with logger.With(ctx, ...). Update imports to use the new logger and adjust any tests to expect structured logging behavior.
Applied to files:
cmd/hyperfleet-api/server/server.gocmd/hyperfleet-api/server/api_server.gotest/helper.gopkg/config/shutdown.gocmd/hyperfleet-api/server/metrics_server.gopkg/config/config.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Health check endpoint at GET /healthcheck returns 200 OK when database is accessible; used by Kubernetes liveness/readiness probes
Applied to files:
test/helper.gocmd/hyperfleet-api/server/metrics_server.goAGENTS.md
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Prometheus metrics are exposed at GET /metrics endpoint for monitoring; configure metric collection via --label-metrics-inclusion-duration
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T09:40:10.138Z
Learnt from: yasun1
Repo: openshift-hyperfleet/hyperfleet-api PR: 31
File: pkg/errors/errors.go:9-9
Timestamp: 2026-01-08T09:40:10.138Z
Learning: In openshift-hyperfleet/hyperfleet-api, pkg/api/openapi is a generated directory (excluded in .gitignore) created by `make generate` using openapi-generator-cli via Docker; linters that run before code generation will report false-positive import errors for this package.
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : HTTP handlers must follow the pattern: parse request → call service/DAO → handle errors → send response with appropriate HTTP status codes
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Accept resource creation/update/delete requests via REST endpoints, validate, and persist to PostgreSQL
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/api/**/*.go : Embed OpenAPI specification at compile time using Go 1.16+ //go:embed directive for self-contained binary deployment
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to cmd/hyperfleet/migrate/**/*.go : Use GORM AutoMigrate at startup via ./bin/hyperfleet-api migrate command; migrations are idempotent and track applied migrations
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/{handlers,services,dao}/**/*.go : Errors must use structured ServiceError type from pkg/errors/ with HttpCode, Code, and Reason fields; send via errors.SendError(w, r, err)
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
📚 Learning: 2026-01-08T01:23:20.647Z
Learnt from: CR
Repo: openshift-hyperfleet/hyperfleet-api PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-08T01:23:20.647Z
Learning: Applies to pkg/handlers/**/*.go : Serve resource data to Sentinel via GET /{resourceType} endpoints with pagination and filtering support
Applied to files:
cmd/hyperfleet-api/server/metrics_server.go
🧬 Code graph analysis (4)
cmd/hyperfleet-api/server/api_server.go (1)
pkg/telemetry/otel.go (1)
Shutdown(67-72)
test/helper.go (5)
pkg/db/session.go (1)
SessionFactory(12-20)pkg/config/config.go (1)
ApplicationConfig(15-22)cmd/hyperfleet-api/environments/types.go (1)
ApplicationConfig(32-34)cmd/hyperfleet-api/server/server.go (1)
Server(13-18)test/factories/factory.go (1)
Factories(9-10)
cmd/hyperfleet-api/server/metrics_server.go (3)
pkg/health/handlers.go (1)
NewHandler(19-23)cmd/hyperfleet-api/environments/types.go (1)
Database(36-38)pkg/db/session.go (1)
SessionFactory(12-20)
pkg/config/config.go (1)
pkg/config/shutdown.go (2)
ShutdownConfig(15-17)NewShutdownConfig(19-23)
🔇 Additional comments (15)
test/helper.go (2)
56-68: LGTM!The
Helperstruct cleanly removes theHealthCheckServerfield, aligning with the PR's consolidation of health endpoints onto the metrics server.
157-179: LGTM!The
Stopcalls now correctly passcontext.Background()to match the updatedServerinterface signature. Usingcontext.Background()is appropriate for test teardown where there's no parent context or timeout requirement.AGENTS.md (3)
347-352: LGTM!Documentation accurately reflects the consolidation of health endpoints onto the metrics server and the new graceful shutdown configuration. The
--shutdown-timeoutflag withSHUTDOWN_TIMEOUTenvironment variable is clearly documented.
690-691: LGTM!Server configuration section correctly updated to show all three endpoints (
/metrics,/healthz,/readyz) served on port 8080.
775-778: LGTM!Health endpoints documentation is clear and accurate:
/healthzfor liveness (process alive check)/readyzfor readiness (database connection check)This aligns with Kubernetes probe conventions and the implementation in
pkg/health/handlers.go.pkg/config/shutdown.go (2)
10-13: LGTM!The 20-second default shutdown timeout is a reasonable choice that provides adequate time for connection draining while not extending shutdown unnecessarily.
15-23: LGTM!Clean struct definition and constructor following the established config pattern in the codebase.
pkg/config/config.go (3)
15-22: LGTM!The
Shutdownfield is correctly added toApplicationConfig, replacing the removedHealthCheckconfiguration. The struct maintains consistency with other config fields.
24-33: LGTM!Constructor properly initializes
ShutdownviaNewShutdownConfig(), ensuring the default timeout is set.
35-43: LGTM!The
--shutdown-timeoutflag is correctly wired throughc.Shutdown.AddFlags(flagset).cmd/hyperfleet-api/server/metrics_server.go (2)
25-29: LGTM!Health endpoints are correctly integrated:
/healthzfor liveness probe (process alive)/readyzfor readiness probe (database connectivity)- Handler properly initialized with
SessionFactoryfor database health checks- HTTP methods correctly restricted to GET
This consolidates health checks onto the metrics server as intended. Based on learnings, this replaces the previous
/healthcheckendpoint on port 8083.
77-79: LGTM!The
Stopmethod correctly accepts a context and passes it tohttp.Server.Shutdown(ctx), enabling graceful shutdown with timeout control from the caller.cmd/hyperfleet-api/server/server.go (2)
13-18: LGTM! Context-aware Stop signature enables proper graceful shutdown.Adding
context.Contextto theStopmethod allows callers to control shutdown timeouts and cancellation, which is the correct pattern for graceful shutdown in Go.
27-34: Correct handling of server closed error.The
checkfunction appropriately filters outhttp.ErrServerClosed, which is expected during graceful shutdown and should not cause an exit.cmd/hyperfleet-api/server/api_server.go (1)
157-158: LGTM! Context is correctly propagated to HTTP server shutdown.The implementation properly passes the context to
httpServer.Shutdown(ctx), allowing the graceful shutdown timeout configured viaSHUTDOWN_TIMEOUTto be respected. This is consistent with similar patterns inpkg/telemetry/otel.go:Shutdown.
| initialDelaySeconds: 30 | ||
| periodSeconds: 10 | ||
| path: /healthz | ||
| port: metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest separating health endpoints from the metrics port and adopting a clear three-port layout:
- Port 8000: API server
- Port 8080: Metrics server (/metrics only, for Prometheus)
- Port 8081: Health server (/healthz, /readyz, for kubelet)
Benefit:
- Single responsibility: Each port serves a well-defined purpose, which makes the runtime model easier to reason about and maintain.
- Failure isolation: Metrics collection issues (e.g., slow scrapes or handler contention) won’t impact health probes, and probe failures won’t interfere with observability.
- Network policy clarity: Clean separation allows tighter and more explicit network policies (Prometheus → metrics, kubelet → health), reducing accidental coupling.
The current two-port setup is functional, but the three-port architecture provides clearer separation of concerns with minimal additional complexity and better long-term maintainability.
| } | ||
| } else { | ||
| checks["database"] = "ok" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, there is a potential risk:
Network partition between application and database:
- CheckConnection() attempts TCP connection to database
- Network is unreachable (e.g., security group change, AZ failure)
- TCP connection blocks until system timeout (~75-120s on Linux)
- Kubelet's readiness probe times out after 3s and closes HTTP connection
- But the handler goroutine continues running for 120s waiting for TCP timeout
- Next probe (10s later) creates another leaked goroutine
- Accumulates ~12 leaked goroutines per 2 minutes during network issues
Impact
- Resource leak: Goroutines accumulate, potentially causing memory pressure
- Connection pool exhaustion: Each leaked goroutine may hold a DB connection attempt
- Pod marked NotReady: Service degradation when all pods affected
Summary
/healthz,/readyz) on metrics server port 8080SHUTDOWN_TIMEOUT(default 20s)signal.NotifyContextfor SIGTERM/SIGINT handlingTest plan
/healthzreturns 200 OK on port 8080/readyzreturns 200 OK when ready, 503 during shutdownSIGTERMsignalSHUTDOWN_TIMEOUTenvironment variable is respectedmake testandmake lintSummary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.