Purpose SentinelMesh is a production-grade distributed uptime monitoring system built to detect downtime with high precision and low false positives. It solves the problem of unreliable monitoring by implementing a multi-stage verification process—ensuring that a single network blip doesn't wake an engineer at 3 AM.
Unlike simple cron-based scripts, SentinelMesh is designed as a distributed system that prioritizes reliability, data integrity, and alert hygiene. It provides a clean, fast dashboard for observance and a robust backend for execution.
Problem Space
- Reliability: Monitoring systems must be more reliable than the systems they monitor.
- Noise: Eliminating "flapping" and false positives via verification.
- Scale: Handling high-throughput check execution horizontally.
- Backend as Source of Truth: The frontend is strictly a view layer. All uptime calculations, incident inferences, and state transitions happen on the backend. This ensures consistency across API and UI.
- Idempotency by Default: All scheduling and alerting operations are idempotent. A worker crashing halfway through a job or a scheduler restarting will not result in duplicate alerts or corrupted state.
- Async over Sync: Heavy lifting (HTTP checks, notifications) is offloaded to background queues (Redis/BullMQ). The API remains responsive by never blocking on external I/O.
- Append-Only Check Data: Check results are immutable facts. We never update a past check; we only append new ones. This provides an audit trail and simplifies concurrency.
- Separation of Concerns: The Scheduler schedules, the Worker executes, and the Verifier decides. This decoupling allows us to scale ingestion (Workers) independently of decision logic (Verifier).
The system is composed of four distinct components:
- API Service (Fastify): Handles CRUD operations for monitors, serves the dashboard data, and manages authentication. It creates "Monitor" records but does not execute checks.
- Scheduler: A lightweight, reliable clock that enqueues jobs into Redis based on each monitor's interval. It does not execute code; it only dispatches generic "Check Job" intent.
- Worker Nodes: Stateless consumers that pull jobs from Redis. They perform the actual HTTP/TCP requests to target URLs, capture the latency/status, and write raw "Check Results" to the database. They utilize the "Fetch" API for standardized networking.
- Verifier / Alerting Engine: An event-driven component that analyzes the stream of incoming "Check Results". It manages the state machine (UP -> DOWN) and triggers side effects (Alerts) only when state transitions occur.
- Monitors (Postgres): Configuration data (URL, Interval). Owned by User/API.
- Check Results (Postgres/Timescale): Append-only immutable log of every execution. High volume. Written by Workers. Read by Verifier/Dashboard.
- Incidents (Postgres): Represents a period of downtime. Stateful (OPEN -> RESOLVED). Owned by Verifier.
- Alerts (Postgres): Idempotent side-effect log. Ensures we don't send the same email twice.
Flow of Data:
Monitor Config -> Scheduler -> Worker Execution -> Check Result (Immutable) -> Verifier -> Incident (Mutable State) -> Alert
- Job Dispatch: The Scheduler wakes up, identifies monitors due for execution, and pushes a
check_monitorjob to the Redis Queue with a uniquejobId(MonitorID + Timestamp). - Execution: An available Worker passes the job. It performs a network request to
example.com. - Result Ingestion: The Worker saves the result (e.g., Status: 200, Latency: 45ms) to the
CheckResulttable. - Verification: The Verifier observes this new result.
- If User is UP and Result is 200: Do nothing.
- If User is UP and Result is 500: Trigger "Verification Mode" (e.g., retry from different region) or immediately create an Incident.
- State Transition: If downtime is confirmed, an
Incidentrecord is created (Status: OPEN). - Notification: The Alerting Engine sees the new Incident and enqueues a notification job (Email/Slack).
- Jobs: We use BullMQ (Redis) for a robust distributed priority queue.
- Idempotency Keys: Every job is deduped using the
monitorIdand thetime slot. If the scheduler restarts, it attempts to re-queue the same slot, but Redis rejects the duplicate ID. - Per-Monitor Intervals: Unlike a global loop, every monitor has its own cadence (10s, 60s, 5m). We use a "next_check_at" timestamp index to efficiently query due monitors.
- Worker Crashes: If a worker dies mid-request, Redis relies on the "Visibility Timeout". The job will eventually time out and be picked up by another worker.
- Database Outage: Workers function in "local buffer" mode or fail fast. The Queue acts as a buffer—jobs pile up in Redis until the DB recovers.
- False Positives: We implement a retries logic before declaring an incident. A single timeout does not trigger a page; 2-3 consecutive failures do.
- State-Based Alerting: We only alert on transitions (UP -> DOWN, DOWN -> UP). We never alert on "Still Down". This radically reduces noise.
- Deduplication: Before sending an email, we check the
Alertstable. If an alert forIncident #123was sent < 5 minutes ago, we suppress the new one. - Grace Period: New monitors start in a "Pending" state to prevent alerts during initial configuration.
| Decision | Why we did it | Trade-off |
|---|---|---|
| Polling vs Evented (Frontend) | Simple to implement, works well with React Query. | Slightly delayed updates (up to 10s) vs real-time sockets. |
| Postgres forTimeSeries | Simplifies stack (one DB). Good enough for <100k monitors. | Storage cost higher than dedicated TSDB (Prometheus/ClickHouse) at massive scale. |
| In-Memory Auth | Extreme security, no client-side persistence tokens. | User must log in on every refresh (Acceptable for MVP/Security-first). |
| Single Region Checks | Simplicity. | Cannot detect regional outages or verify global availability. |
- Scheduler: Can be sharded by
Monitor ID(Monitors 1-1000 -> Scheduler A, etc.) to handle millions of monitors. - Workers: Horizontally scalable. Just add more containers/pods consuming the Redis queue.
- Database:
- Read Replicas: For dashboard queries.
- Partitioning:
CheckResultstable should be partitioned by time (e.g., monthly tables) to maintain query performance as history grows. - Archival: Move old check results to cold storage (S3/Parquet) after 30 days.
- JWT in Memory: We deliberately avoid
localStorageto prevent XSS attacks from leaking tokens. - Read-Only Frontend: The frontend has zero business logic. It trustlessly renders what the backend sends.
- Input Validation: Strict
zodschemas on both Frontend forms and Backend API endpoints. - No Secrets: All API keys and connection strings are strictly environment variables, never committed.
- Structured Logging: JSON logs with
monitorIdandjobIdcontext allow us to trace a specific check through the entire pipeline. - Correlation IDs: Every request has a
request-idpassed from Nginx -> API -> DB. - Health Checks:
/healthendpoint exposes internal component status (Redis connection, DB lag) for k8s liveness probes.
- Distributed Systems Implementation: Shows ability to reason about queues, workers, and concurrency.
- Full Stack Proficiency: From Next.js UI polish to Node.js backend architecture.
- Operational Maturity: Focus on reliability, idempotency, and "day 2" operations over feature capability.
- Clean Architecture: Separation of concerns and type safety (TypeScript) across the stack.
Prerequisites
- Node.js 20+
- Docker & Docker Compose (for Postgres/Redis)
- pnpm
Setup
-
Start Infrastructure:
docker-compose up -d
-
Environment Variables:
cp .env.example .env
-
Install & Build:
pnpm install pnpm build
(This builds core packages, api, and web)
-
Run Development:
pnpm dev
- Frontend:
http://localhost:3000 - Backend:
http://localhost:3001
- Frontend: