Skip to content

Automated Pod Hard Reset Policy for Persistent CrashLoopBackOff #5734

@merihilgor

Description

@merihilgor

1. Summary

This KEP proposes a new, optional policy field within the Deployment workload API to enable automatic Pod deletion and recreation (a "Hard Reset") after the container's native CrashLoopBackOff restarts fail repeatedly on the same node.

The primary goal is to improve the self-healing capacity of Kubernetes for long-running workloads by automatically resolving persistent, node-local failures (e.g., stuck volume mounts, non-transient resource contention) without requiring external monitoring or manual intervention (kubectl delete pod).

2. Motivation

2.1 The Problem with Current Behavior

The default restartPolicy: Always relies on the Kubelet performing in-place container restarts with exponential backoff (capped at 5 minutes). While excellent for handling transient application errors, this mechanism fails when the root cause is tied to the Pod's local environment on the current Node:

  • Node-Local Persistent Errors: The Kubelet is unable to resolve issues like a corrupted emptyDir or a host-specific lock.

  • Indefinite Resource Consumption: The Pod remains in a non-functional state, consuming resources and generating restart events indefinitely, capped at the 5-minute backoff interval.

  • Requires Manual Intervention: The only reliable fix is a human operator running kubectl delete pod <name>, which forces the Deployment Controller to schedule a replacement Pod on a potentially different, healthy node.

2.2 User Story (Operator Persona)

As a Cluster Operator, I need Kubernetes to automatically perform a "Hard Reset" (Pod delete and reschedule) when a Pod has been stuck in a repeated CrashLoopBackOff cycle for a significant duration, so that I don't have to write custom monitoring and automation to handle obvious node-local resource failures.

3. Goals

  1. Introduce a new policy field to the Deployment API to define the hard reset condition.

  2. Define sensible default values for this policy that trigger a Hard Reset for any Deployment that does not specify the field.

  3. Ensure the Hard Reset action is to delete the Pod (not the Deployment), forcing a fresh reschedule via the existing Deployment Controller logic.

  4. Prevent scheduler overload by implementing an internal cooldown between Hard Resets for the same deployment.

4. Proposal Detail

4.1 New API Field (Alpha)

Introduce a new optional field, PodFailurePolicy, to the Deployment.spec.template.spec (mirroring the concept in Jobs, but with different actions):

YAML
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers: [...]
      # New optional policy field:
      failurePolicy:
        # Action to take when MaxRestartsOnNode is exceeded.
        action: DeletePod
        # Hard limit on the number of container restarts on a single node before action is taken.
        # This counter should be reset on successful container uptime (e.g., 10 minutes).
        maxRestartsOnNode: 7 

4.2 Proposed Automatic Default Behavior (The Core Enhancement)

To meet the goal of working by default without explicit configuration, the Deployment Controller will observe the Kubelet's Pod Status for all Pods it manages.

  • Condition Check: The Controller will check the Pod.status.containerStatuses[*].restartCount.

  • Default Threshold: If the restartCount for any container in a Pod exceeds 7 (matching Job.spec.backoffLimit + 1), AND the Pod has not successfully achieved a "Ready" state within the preceding 10 minutes, the Hard Reset is triggered.

  • Action: The Deployment Controller performs a direct DELETE operation on the Pod object.

    • Result: The deletion event triggers the Deployment Controller's main reconciliation loop, which immediately notices one fewer ready replica than desired and creates a new, replacement Pod. This replacement will be scheduled by the Kubernetes Scheduler, likely on a different node.

4.3 Cooldown and Throttling (Resilience)

To prevent a cascade of Hard Resets from overwhelming the control plane (the "Thundering Herd" on the Scheduler):

  • The Deployment Controller will implement a Deployment-wide cooldown of 2 minutes between Hard Resets for a single Deployment.

  • If a Pod is flagged for Hard Reset during this cooldown period, the action is deferred until the cooldown expires.

5. Test Plan (Alpha)

  1. Unit Tests: Verify that the Deployment Controller correctly increments and resets the failure counter based on Kubelet events.

  2. Integration Tests:

    • Deploy a simple application designed to crash instantly (e.g., exit 1).

    • Verify the container's restart count reaches 7.

    • Verify the Deployment Controller automatically deletes the Pod and creates a new one.

    • Verify a Hard Reset is triggered only when the new failurePolicy field is omitted (default behavior) or set to the proposed action.

    • Verify the 2-minute throttling mechanism correctly limits Hard Resets.

6. Graduation Criteria

Alpha -> Beta

  • API field is merged into Deployment as Alpha.

  • Successful implementation of the Deployment Controller logic.

  • Positive feedback from at least two non-owner contributors/end-users.

  • Controller-level throttling mechanism is proven stable in tests.

Beta -> GA

  • API changes are proven stable for at least two releases.

  • Extensive end-to-end testing in large-scale clusters, demonstrating the automatic healing of node-local failures without increasing scheduler burden.

  • Agreement with SIG-Node on Kubelet interaction and event monitoring.

7. Risks and Mitigation

Risk | Mitigation -- | -- Control Plane Overload | Implementation of a 2-minute per-deployment cooldown for Hard Resets. This prioritizes stability over immediate recovery. Data Loss | A hard reset deletes the Pod, terminating any data in emptyDir. This risk is mitigated by clear documentation that this policy is intended to resolve otherwise unrecoverable failures. Masking Application Bugs | Hard Resets will be logged as a clear event. Operators will need to be educated that repeated Hard Resets (e.g., over multiple hours) still indicate a persistent application bug requiring code fix. Conflict with Existing Liveness Probes | The policy must only trigger on container termination/crash events, not Liveness Probe failures (which already result in a restart).

Metadata

Metadata

Assignees

No one assigned

    Labels

    sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    Status

    Needs Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions