-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
As a DevOps engineer using OpsCtrl, I want the in-cluster daemon to automatically remediate known issues so that common problems are fixed without manual intervention, reducing MTTR and on-call burden.
Actors
- Daemon - Runs in client cluster, monitors pods, reports incidents, executes remediations
- Backend - Stores incidents, runs LLM diagnosis, manages remediation queue
- User - Receives Slack notifications, can trigger manual remediation
User Stories
1. Automatic Remediation for Known Issues
As a DevOps engineer
I want the daemon to automatically fix common issues (like CrashLoopBackOff due to OOM)
So that I don't get paged for problems with known solutions
Acceptance Criteria:
- Daemon polls backend for pending remediations every 15-30 seconds
- Only predefined safe actions execute automatically:
restart_pod- Delete pod to trigger restartscale_deployment- Adjust replica count
- Remediation status tracked:
pending→running→succeeded/failed - Incident updated with remediation outcome
2. Manual Remediation via Slack
As a DevOps engineer
I want to click "Fix Now" in a Slack notification
So that I can trigger a remediation without logging into any dashboard
Acceptance Criteria:
- Slack notification includes "Fix Now" button for actionable incidents
- Button click creates remediation record with
status: pending - Daemon picks up and executes on next poll
- Slack thread updated with remediation result
3. Approval Required for Destructive Actions
As a platform engineer
I want destructive actions to require approval
So that LLM-suggested commands don't accidentally break production
Acceptance Criteria:
- These actions require approval before execution:
rollback_deploymentupdate_resource_limitscustom_command(any LLM-generated command)
- Approval can be granted via Slack button or dashboard
- Remediation stays in
pendinguntil approved, then moves toapproved - Daemon only executes
approvedor auto-approved remediations
4. Remediation History & Audit Trail
As a platform engineer
I want to see all remediation attempts for an incident
So that I can understand what was tried and why it failed
Acceptance Criteria:
- Each remediation attempt stored as separate record
- Tracks: command executed, output, exit code, duration, executor
- Incident shows
remediationCountandlastRemediationAt - Failed attempts don't block new attempts (up to
maxAttempts)
Technical Design
Authentication
Daemon uses existing cluster-scoped refresh tokens (device flow). No new auth mechanism needed.
API Endpoints
GET /clusters/:clusterId/remediations?status=pending,approved
→ Returns remediations ready for execution
POST /clusters/:clusterId/remediations/:id/claim
→ Marks as 'running', sets startedAt, returns full remediation details
POST /clusters/:clusterId/remediations/:id/complete
→ Body: { status, outputLog, errorMessage, exitCode }
→ Updates remediation, updates incident counters
Remediation Types (Predefined & Safe)
| Type | Auto-Execute | Description |
|---|---|---|
restart_pod |
Yes | kubectl delete pod <name> |
scale_deployment |
Yes | kubectl scale deployment <name> --replicas=<n> |
rollback_deployment |
No (approval) | kubectl rollout undo deployment <name> |
update_resource_limits |
No (approval) | Patch deployment with new limits |
execute_command |
No (approval) | Run arbitrary command |
custom |
No (approval) | LLM-suggested custom fix |
Polling Flow
┌─────────────────────────────────────────────────────────────┐
│ Daemon (in client cluster) │
│ │
│ every 15-30s: │
│ 1. GET /clusters/:id/remediations?status=pending,approved│
│ 2. For each remediation: │
│ a. POST /claim (mark as running) │
│ b. Execute command based on type │
│ c. POST /complete (report result) │
│ 3. Sleep │
└─────────────────────────────────────────────────────────────┘
Trigger Points
- Auto-trigger: When incident created with
autoRemediationEnabled: trueand remediation type is auto-executable - Slack button: User clicks "Fix Now" → POST /incidents/:id/remediate
- Dashboard: User clicks remediate button → POST /incidents/:id/remediate
- API: Direct POST to create remediation
Safety Guardrails
- Type allowlist: Only predefined types, no arbitrary commands without approval
- Max attempts: Default 3 attempts per incident, prevents infinite loops
- Cooldown: Minimum time between remediation attempts (e.g., 60s)
- Cluster scope: Daemon can only fetch/execute remediations for its own cluster
- Audit logging: All remediation actions logged with full context
Data Model
Remediation Entity (existing)
{
id: uuid,
incidentId: uuid, // Links to incident
clusterId: uuid, // For polling scope
type: RemediationType, // restart_pod, scale_deployment, etc.
status: RemediationStatus, // pending, approved, running, succeeded, failed
source: RemediationSource, // auto, manual, scheduled
command: string, // Actual command to execute
parameters: jsonb, // Type-specific params (podName, replicas, etc.)
approvedAt: timestamp,
approvedBy: uuid,
startedAt: timestamp,
completedAt: timestamp,
durationMs: number,
executedBy: string, // daemon-id or user-id
targetPod: string,
targetNamespace: string,
outputLog: text,
errorMessage: text,
exitCode: number,
attemptNumber: number,
maxAttempts: number,
previousAttemptId: uuid // Links retry chain
}Future Enhancements
- SSE/WebSocket: Real-time push instead of polling
- Runbooks: Pre-approved multi-step remediation sequences
- Dry-run mode: Preview what would happen before executing
- Rollback: Automatic rollback if remediation makes things worse
- Metrics: Track remediation success rates per type/cluster
Dependencies
- Incident entity with
autoRemediationEnabledfield ✅ - Remediation entity ✅
- Slack integration (next task)
- Daemon polling implementation (client-side)