Skip to content

User Story: In-Cluster Remediation System #27

@orchide

Description

@orchide

Overview

As a DevOps engineer using OpsCtrl, I want the in-cluster daemon to automatically remediate known issues so that common problems are fixed without manual intervention, reducing MTTR and on-call burden.


Actors

  • Daemon - Runs in client cluster, monitors pods, reports incidents, executes remediations
  • Backend - Stores incidents, runs LLM diagnosis, manages remediation queue
  • User - Receives Slack notifications, can trigger manual remediation

User Stories

1. Automatic Remediation for Known Issues

As a DevOps engineer
I want the daemon to automatically fix common issues (like CrashLoopBackOff due to OOM)
So that I don't get paged for problems with known solutions

Acceptance Criteria:

  • Daemon polls backend for pending remediations every 15-30 seconds
  • Only predefined safe actions execute automatically:
    • restart_pod - Delete pod to trigger restart
    • scale_deployment - Adjust replica count
  • Remediation status tracked: pendingrunningsucceeded/failed
  • Incident updated with remediation outcome

2. Manual Remediation via Slack

As a DevOps engineer
I want to click "Fix Now" in a Slack notification
So that I can trigger a remediation without logging into any dashboard

Acceptance Criteria:

  • Slack notification includes "Fix Now" button for actionable incidents
  • Button click creates remediation record with status: pending
  • Daemon picks up and executes on next poll
  • Slack thread updated with remediation result

3. Approval Required for Destructive Actions

As a platform engineer
I want destructive actions to require approval
So that LLM-suggested commands don't accidentally break production

Acceptance Criteria:

  • These actions require approval before execution:
    • rollback_deployment
    • update_resource_limits
    • custom_command (any LLM-generated command)
  • Approval can be granted via Slack button or dashboard
  • Remediation stays in pending until approved, then moves to approved
  • Daemon only executes approved or auto-approved remediations

4. Remediation History & Audit Trail

As a platform engineer
I want to see all remediation attempts for an incident
So that I can understand what was tried and why it failed

Acceptance Criteria:

  • Each remediation attempt stored as separate record
  • Tracks: command executed, output, exit code, duration, executor
  • Incident shows remediationCount and lastRemediationAt
  • Failed attempts don't block new attempts (up to maxAttempts)

Technical Design

Authentication

Daemon uses existing cluster-scoped refresh tokens (device flow). No new auth mechanism needed.

API Endpoints

GET  /clusters/:clusterId/remediations?status=pending,approved
     → Returns remediations ready for execution

POST /clusters/:clusterId/remediations/:id/claim
     → Marks as 'running', sets startedAt, returns full remediation details

POST /clusters/:clusterId/remediations/:id/complete
     → Body: { status, outputLog, errorMessage, exitCode }
     → Updates remediation, updates incident counters

Remediation Types (Predefined & Safe)

Type Auto-Execute Description
restart_pod Yes kubectl delete pod <name>
scale_deployment Yes kubectl scale deployment <name> --replicas=<n>
rollback_deployment No (approval) kubectl rollout undo deployment <name>
update_resource_limits No (approval) Patch deployment with new limits
execute_command No (approval) Run arbitrary command
custom No (approval) LLM-suggested custom fix

Polling Flow

┌─────────────────────────────────────────────────────────────┐
│ Daemon (in client cluster)                                  │
│                                                             │
│  every 15-30s:                                              │
│    1. GET /clusters/:id/remediations?status=pending,approved│
│    2. For each remediation:                                 │
│       a. POST /claim (mark as running)                      │
│       b. Execute command based on type                      │
│       c. POST /complete (report result)                     │
│    3. Sleep                                                 │
└─────────────────────────────────────────────────────────────┘

Trigger Points

  1. Auto-trigger: When incident created with autoRemediationEnabled: true and remediation type is auto-executable
  2. Slack button: User clicks "Fix Now" → POST /incidents/:id/remediate
  3. Dashboard: User clicks remediate button → POST /incidents/:id/remediate
  4. API: Direct POST to create remediation

Safety Guardrails

  1. Type allowlist: Only predefined types, no arbitrary commands without approval
  2. Max attempts: Default 3 attempts per incident, prevents infinite loops
  3. Cooldown: Minimum time between remediation attempts (e.g., 60s)
  4. Cluster scope: Daemon can only fetch/execute remediations for its own cluster
  5. Audit logging: All remediation actions logged with full context

Data Model

Remediation Entity (existing)

{
  id: uuid,
  incidentId: uuid,        // Links to incident
  clusterId: uuid,         // For polling scope

  type: RemediationType,   // restart_pod, scale_deployment, etc.
  status: RemediationStatus, // pending, approved, running, succeeded, failed
  source: RemediationSource, // auto, manual, scheduled

  command: string,         // Actual command to execute
  parameters: jsonb,       // Type-specific params (podName, replicas, etc.)

  approvedAt: timestamp,
  approvedBy: uuid,

  startedAt: timestamp,
  completedAt: timestamp,
  durationMs: number,

  executedBy: string,      // daemon-id or user-id
  targetPod: string,
  targetNamespace: string,

  outputLog: text,
  errorMessage: text,
  exitCode: number,

  attemptNumber: number,
  maxAttempts: number,
  previousAttemptId: uuid  // Links retry chain
}

Future Enhancements

  • SSE/WebSocket: Real-time push instead of polling
  • Runbooks: Pre-approved multi-step remediation sequences
  • Dry-run mode: Preview what would happen before executing
  • Rollback: Automatic rollback if remediation makes things worse
  • Metrics: Track remediation success rates per type/cluster

Dependencies

  • Incident entity with autoRemediationEnabled field ✅
  • Remediation entity ✅
  • Slack integration (next task)
  • Daemon polling implementation (client-side)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions