User Story: In-Cluster Remediation System

## Overview

As a DevOps engineer using OpsCtrl, I want the in-cluster daemon to automatically remediate known issues so that common problems are fixed without manual intervention, reducing MTTR and on-call burden.

---

## Actors

- **Daemon** - Runs in client cluster, monitors pods, reports incidents, executes remediations
- **Backend** - Stores incidents, runs LLM diagnosis, manages remediation queue
- **User** - Receives Slack notifications, can trigger manual remediation

---

## User Stories

### 1. Automatic Remediation for Known Issues

**As a** DevOps engineer
**I want** the daemon to automatically fix common issues (like CrashLoopBackOff due to OOM)
**So that** I don't get paged for problems with known solutions

**Acceptance Criteria:**
- Daemon polls backend for pending remediations every 15-30 seconds
- Only predefined safe actions execute automatically:
  - `restart_pod` - Delete pod to trigger restart
  - `scale_deployment` - Adjust replica count
- Remediation status tracked: `pending` → `running` → `succeeded`/`failed`
- Incident updated with remediation outcome

---

### 2. Manual Remediation via Slack

**As a** DevOps engineer
**I want** to click "Fix Now" in a Slack notification
**So that** I can trigger a remediation without logging into any dashboard

**Acceptance Criteria:**
- Slack notification includes "Fix Now" button for actionable incidents
- Button click creates remediation record with `status: pending`
- Daemon picks up and executes on next poll
- Slack thread updated with remediation result

---

### 3. Approval Required for Destructive Actions

**As a** platform engineer
**I want** destructive actions to require approval
**So that** LLM-suggested commands don't accidentally break production

**Acceptance Criteria:**
- These actions require approval before execution:
  - `rollback_deployment`
  - `update_resource_limits`
  - `custom_command` (any LLM-generated command)
- Approval can be granted via Slack button or dashboard
- Remediation stays in `pending` until approved, then moves to `approved`
- Daemon only executes `approved` or auto-approved remediations

---

### 4. Remediation History & Audit Trail

**As a** platform engineer
**I want** to see all remediation attempts for an incident
**So that** I can understand what was tried and why it failed

**Acceptance Criteria:**
- Each remediation attempt stored as separate record
- Tracks: command executed, output, exit code, duration, executor
- Incident shows `remediationCount` and `lastRemediationAt`
- Failed attempts don't block new attempts (up to `maxAttempts`)

---

## Technical Design

### Authentication

Daemon uses existing cluster-scoped refresh tokens (device flow). No new auth mechanism needed.

### API Endpoints

```
GET  /clusters/:clusterId/remediations?status=pending,approved
     → Returns remediations ready for execution

POST /clusters/:clusterId/remediations/:id/claim
     → Marks as 'running', sets startedAt, returns full remediation details

POST /clusters/:clusterId/remediations/:id/complete
     → Body: { status, outputLog, errorMessage, exitCode }
     → Updates remediation, updates incident counters
```

### Remediation Types (Predefined & Safe)

| Type | Auto-Execute | Description |
|------|--------------|-------------|
| `restart_pod` | Yes | `kubectl delete pod <name>` |
| `scale_deployment` | Yes | `kubectl scale deployment <name> --replicas=<n>` |
| `rollback_deployment` | No (approval) | `kubectl rollout undo deployment <name>` |
| `update_resource_limits` | No (approval) | Patch deployment with new limits |
| `execute_command` | No (approval) | Run arbitrary command |
| `custom` | No (approval) | LLM-suggested custom fix |

### Polling Flow

```
┌─────────────────────────────────────────────────────────────┐
│ Daemon (in client cluster)                                  │
│                                                             │
│  every 15-30s:                                              │
│    1. GET /clusters/:id/remediations?status=pending,approved│
│    2. For each remediation:                                 │
│       a. POST /claim (mark as running)                      │
│       b. Execute command based on type                      │
│       c. POST /complete (report result)                     │
│    3. Sleep                                                 │
└─────────────────────────────────────────────────────────────┘
```

### Trigger Points

1. **Auto-trigger**: When incident created with `autoRemediationEnabled: true` and remediation type is auto-executable
2. **Slack button**: User clicks "Fix Now" → POST /incidents/:id/remediate
3. **Dashboard**: User clicks remediate button → POST /incidents/:id/remediate
4. **API**: Direct POST to create remediation

### Safety Guardrails

1. **Type allowlist**: Only predefined types, no arbitrary commands without approval
2. **Max attempts**: Default 3 attempts per incident, prevents infinite loops
3. **Cooldown**: Minimum time between remediation attempts (e.g., 60s)
4. **Cluster scope**: Daemon can only fetch/execute remediations for its own cluster
5. **Audit logging**: All remediation actions logged with full context

---

## Data Model

### Remediation Entity (existing)

```typescript
{
  id: uuid,
  incidentId: uuid,        // Links to incident
  clusterId: uuid,         // For polling scope

  type: RemediationType,   // restart_pod, scale_deployment, etc.
  status: RemediationStatus, // pending, approved, running, succeeded, failed
  source: RemediationSource, // auto, manual, scheduled

  command: string,         // Actual command to execute
  parameters: jsonb,       // Type-specific params (podName, replicas, etc.)

  approvedAt: timestamp,
  approvedBy: uuid,

  startedAt: timestamp,
  completedAt: timestamp,
  durationMs: number,

  executedBy: string,      // daemon-id or user-id
  targetPod: string,
  targetNamespace: string,

  outputLog: text,
  errorMessage: text,
  exitCode: number,

  attemptNumber: number,
  maxAttempts: number,
  previousAttemptId: uuid  // Links retry chain
}
```

---

## Future Enhancements

- **SSE/WebSocket**: Real-time push instead of polling
- **Runbooks**: Pre-approved multi-step remediation sequences
- **Dry-run mode**: Preview what would happen before executing
- **Rollback**: Automatic rollback if remediation makes things worse
- **Metrics**: Track remediation success rates per type/cluster

---

## Dependencies

- Incident entity with `autoRemediationEnabled` field ✅
- Remediation entity ✅
- Slack integration (next task)
- Daemon polling implementation (client-side)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

User Story: In-Cluster Remediation System #27

Overview

Actors

User Stories

1. Automatic Remediation for Known Issues

2. Manual Remediation via Slack

3. Approval Required for Destructive Actions

4. Remediation History & Audit Trail

Technical Design

Authentication

API Endpoints

Remediation Types (Predefined & Safe)

Polling Flow

Trigger Points

Safety Guardrails

Data Model

Remediation Entity (existing)

Future Enhancements

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	Auto-Execute	Description
`restart_pod`	Yes	`kubectl delete pod <name>`
`scale_deployment`	Yes	`kubectl scale deployment <name> --replicas=<n>`
`rollback_deployment`	No (approval)	`kubectl rollout undo deployment <name>`
`update_resource_limits`	No (approval)	Patch deployment with new limits
`execute_command`	No (approval)	Run arbitrary command
`custom`	No (approval)	LLM-suggested custom fix

User Story: In-Cluster Remediation System #27

Description

Overview

Actors

User Stories

1. Automatic Remediation for Known Issues

2. Manual Remediation via Slack

3. Approval Required for Destructive Actions

4. Remediation History & Audit Trail

Technical Design

Authentication

API Endpoints

Remediation Types (Predefined & Safe)

Polling Flow

Trigger Points

Safety Guardrails

Data Model

Remediation Entity (existing)

Future Enhancements

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions