-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
1. Summary
This KEP proposes a new, optional policy field within the Deployment workload API to enable automatic Pod deletion and recreation (a "Hard Reset") after the container's native CrashLoopBackOff restarts fail repeatedly on the same node.
The primary goal is to improve the self-healing capacity of Kubernetes for long-running workloads by automatically resolving persistent, node-local failures (e.g., stuck volume mounts, non-transient resource contention) without requiring external monitoring or manual intervention (kubectl delete pod).
2. Motivation
2.1 The Problem with Current Behavior
The default restartPolicy: Always relies on the Kubelet performing in-place container restarts with exponential backoff (capped at 5 minutes). While excellent for handling transient application errors, this mechanism fails when the root cause is tied to the Pod's local environment on the current Node:
Node-Local Persistent Errors: The Kubelet is unable to resolve issues like a corrupted
emptyDiror a host-specific lock.Indefinite Resource Consumption: The Pod remains in a non-functional state, consuming resources and generating restart events indefinitely, capped at the 5-minute backoff interval.
Requires Manual Intervention: The only reliable fix is a human operator running
kubectl delete pod <name>, which forces the Deployment Controller to schedule a replacement Pod on a potentially different, healthy node.
2.2 User Story (Operator Persona)
As a Cluster Operator, I need Kubernetes to automatically perform a "Hard Reset" (Pod delete and reschedule) when a Pod has been stuck in a repeated CrashLoopBackOff cycle for a significant duration, so that I don't have to write custom monitoring and automation to handle obvious node-local resource failures.
3. Goals
Introduce a new policy field to the
DeploymentAPI to define the hard reset condition.Define sensible default values for this policy that trigger a Hard Reset for any Deployment that does not specify the field.
Ensure the Hard Reset action is to delete the Pod (not the Deployment), forcing a fresh reschedule via the existing Deployment Controller logic.
Prevent scheduler overload by implementing an internal cooldown between Hard Resets for the same deployment.
4. Proposal Detail
4.1 New API Field (Alpha)
Introduce a new optional field, PodFailurePolicy, to the Deployment.spec.template.spec (mirroring the concept in Jobs, but with different actions):
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers: [...]
# New optional policy field:
failurePolicy:
# Action to take when MaxRestartsOnNode is exceeded.
action: DeletePod
# Hard limit on the number of container restarts on a single node before action is taken.
# This counter should be reset on successful container uptime (e.g., 10 minutes).
maxRestartsOnNode: 7
4.2 Proposed Automatic Default Behavior (The Core Enhancement)
To meet the goal of working by default without explicit configuration, the Deployment Controller will observe the Kubelet's Pod Status for all Pods it manages.
Condition Check: The Controller will check the
Pod.status.containerStatuses[*].restartCount.Default Threshold: If the
restartCountfor any container in a Pod exceeds 7 (matchingJob.spec.backoffLimit+ 1), AND the Pod has not successfully achieved a "Ready" state within the preceding 10 minutes, the Hard Reset is triggered.Action: The Deployment Controller performs a direct
DELETEoperation on the Pod object.Result: The deletion event triggers the Deployment Controller's main reconciliation loop, which immediately notices one fewer ready replica than desired and creates a new, replacement Pod. This replacement will be scheduled by the Kubernetes Scheduler, likely on a different node.
4.3 Cooldown and Throttling (Resilience)
To prevent a cascade of Hard Resets from overwhelming the control plane (the "Thundering Herd" on the Scheduler):
The Deployment Controller will implement a Deployment-wide cooldown of 2 minutes between Hard Resets for a single Deployment.
If a Pod is flagged for Hard Reset during this cooldown period, the action is deferred until the cooldown expires.
5. Test Plan (Alpha)
Unit Tests: Verify that the Deployment Controller correctly increments and resets the failure counter based on Kubelet events.
Integration Tests:
Deploy a simple application designed to crash instantly (e.g.,
exit 1).Verify the container's restart count reaches 7.
Verify the Deployment Controller automatically deletes the Pod and creates a new one.
Verify a Hard Reset is triggered only when the new
failurePolicyfield is omitted (default behavior) or set to the proposed action.Verify the 2-minute throttling mechanism correctly limits Hard Resets.
6. Graduation Criteria
Alpha -> Beta
API field is merged into
DeploymentasAlpha.Successful implementation of the Deployment Controller logic.
Positive feedback from at least two non-owner contributors/end-users.
Controller-level throttling mechanism is proven stable in tests.
Beta -> GA
API changes are proven stable for at least two releases.
Extensive end-to-end testing in large-scale clusters, demonstrating the automatic healing of node-local failures without increasing scheduler burden.
Agreement with SIG-Node on Kubelet interaction and event monitoring.
7. Risks and Mitigation
Metadata
Metadata
Assignees
Labels
Type
Projects
Status