Kubernetes pod monitoring daemon with automated failure detection, diagnosis, and Slack alerting.
OpsCtrl Daemon watches your Kubernetes cluster for pod failures and automatically diagnoses root causes using LLM-powered analysis. When issues occur, it sends detailed alerts with remediation suggestions directly to Slack.
Key Features:
- Real-time detection of CrashLoopBackOff, OOMKill, ImagePull failures, and more
- AI-powered root cause analysis with actionable fix suggestions
- Slack integration for instant incident notifications
- Read-only operation - never executes into containers or accesses secrets
- Lightweight single-replica deployment
- Kubernetes 1.21+
- Helm 3.x
- An OpsCtrl account (sign up free)
# 1. Add the OpsCtrl Helm repository
helm repo add opsctrl https://charts.opsctrl.dev
helm repo update
# 2. Create the namespace
kubectl create namespace opsctrl
# 3. Install OpsCtrl Daemon
helm install opsctrl-daemon opsctrl/opsctrl-daemon \
--namespace opsctrl \
--set clusterRegistration.clusterName="my-cluster" \
--set clusterRegistration.userEmail="you@example.com" \
--set monitoring.watchNamespaces="default" \# Check the pod is running
kubectl get pods -n opsctrl
# View logs to confirm monitoring started
kubectl logs -n opsctrl -l app.kubernetes.io/name=opsctrl-daemon -fYou should see:
π Starting opsctrl-daemon...
π Cluster registration is required before starting monitoring...
β
Cluster registered successfully: <cluster-id>
β
Monitoring started for 1 namespaces
Monitor multiple namespaces
helm install opsctrl-daemon opsctrl/opsctrl-daemon \
--namespace opsctrl \
--set clusterRegistration.clusterName="my-cluster" \
--set clusterRegistration.userEmail="you@example.com" \
--set monitoring.watchNamespaces="default\,staging\,production" \
--set secrets.existingSecret="opsctrl-secrets"Note: Escape commas with
\,in--setor use a values file instead.
Using a values file
Create my-values.yaml:
clusterRegistration:
clusterName: "production-cluster"
userEmail: "platform-team@company.com"
monitoring:
watchNamespaces: "default,staging,production"
excludeNamespaces: "kube-system,kube-public"
minRestartThreshold: 3
secrets:
existingSecret: "opsctrl-secrets"Install with:
helm install opsctrl-daemon opsctrl/opsctrl-daemon \
--namespace opsctrl \
-f my-values.yamlUpgrade an existing installation
helm repo update
helm upgrade opsctrl-daemon opsctrl/opsctrl-daemon \
--namespace opsctrl \
--reuse-valuesUninstall
helm uninstall opsctrl-daemon --namespace opsctrl
kubectl delete namespace opsctrlkubectl apply -f https://raw.githubusercontent.com/Hillyon-Labs/opsctrl_daemon/main/k8s-deployment.yamlSee values.yaml for all configuration options.
| Variable | Description | Required | Default |
|---|---|---|---|
WATCH_NAMESPACES |
Comma-separated namespaces to monitor | Yes | - |
OPSCTRL_BACKEND_URL |
Backend API URL | Yes | - |
CLUSTER_NAME |
Unique cluster identifier | No | - |
WEBHOOK_URL |
Slack webhook URL for alerts | No | - |
MIN_RESTART_THRESHOLD |
Container restarts before alerting | No | 3 |
LOG_LEVEL |
Logging verbosity (error, warn, info, debug) | No | info |
| Failure Type | Description |
|---|---|
CrashLoopBackOff |
Container repeatedly crashing |
OOMKilled |
Out of memory termination |
ImagePullBackOff |
Failed to pull container image |
Pending |
Pod stuck in pending state |
Failed |
Pod entered failed phase |
kubectl logs -n monitoring -l app.kubernetes.io/name=opsctrl-daemon -fkubectl port-forward -n monitoring svc/opsctrl-daemon 3000:3000
curl http://localhost:3000/healthWhen a failure is detected, you'll receive alerts like:
π CrashLoopBackOff in orders-api (production)
Root Cause: Readiness probe failing on /healthz - connection timeout after 1s
Suggested Fix:
kubectl patch deployment orders-api -n production \
--type='json' \
-p='[{"op":"replace","path":"/spec/template/spec/containers/0/readinessProbe/timeoutSeconds","value":5}]'
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Pod A β β Pod B β β Pod C β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββ β
β β OpsCtrl Daemon β βββ Watch API β
β ββββββββββ¬βββββββββ β
ββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β OpsCtrl Backend β βββ LLM Analysis
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Slack β
βββββββββββββββββββ
- Node.js 20+
- npm
- Access to a Kubernetes cluster (local or remote)
# Clone the repository
git clone https://github.com/Hillyon-Labs/opsctrl_daemon.git
cd opsctrl_daemon
# Install dependencies
npm install
# Copy environment template
cp .env.example .env
# Edit .env with your configuration
# Run in development mode
npm run dev# Run tests
npm test
# Run tests with coverage
npm run test:coverage# Build TypeScript
npm run build
# Build Docker image
docker build -t opsctrl/daemon:local .The daemon operates in read-only mode and requires minimal permissions:
| Resource | Verbs | Purpose |
|---|---|---|
pods |
get, list, watch | Monitor pod status |
namespaces |
get, list, watch | Namespace filtering |
events |
get, list, watch, create | Failure detection |
leases |
get, list, watch, create, update, patch | Leader election |
The daemon does not:
- Execute into containers
- Access Secrets or ConfigMaps
- Modify any workloads
- Send container logs externally
Contributions are welcome! Please read our Contributing Guide before submitting a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- GitHub Issues - Bug reports and feature requests
- Documentation - Full documentation
This project is licensed under the MIT License - see the LICENSE file for details.
Built with care by Hillyon Labs