Web-based dashboard and controller for managing SkyPilot experiments with declarative desired state management, automatic reconciliation, and persistent job history tracking.
- Declarative State Management: Set desired state (RUNNING/STOPPED/TERMINATED), system automatically reconciles
- Job History Tracking: Complete history of all job executions per experiment
- Dynamic Flag Configuration: Type-safe configuration using flags with autocomplete
- Automatic Reconciliation: Background reconciler ensures current state matches desired state
- SQLite Persistence: All state survives restarts
- Real-time Updates: Web UI polls and displays current status
# Install dependencies
cd packages/skydeck
uv pip install -e .
# Run the dashboard
uv run python -m skydeck.run
# Access at http://localhost:8000- Data Models (
models.py): Job and Experiment Pydantic models - Database Layer (
database.py): SQLite async operations - Flag Schema (
flag_schema.py): Dynamic flag inference from Pydantic models - State Manager (
state_manager.py): Cache of current SkyPilot state - Desired State Manager (
desired_state.py): CRUD for experiments - Background Poller (
poller.py): Polls SkyPilot every 30s - Reconciler (
reconciler.py): Brings current state → desired state - FastAPI Backend (
app.py): REST API and static file serving - Web UI (
static/): Single-page dashboard
Experiment: Configuration template that spawns jobs
- Has desired_state (what you want) and current_state (actual status)
- Configuration stored as flags:
Dict[str, Union[str, int, float, bool]] - Only one running job per experiment at a time
Job: Single SkyPilot job execution
- Linked to parent experiment
- Full execution history: timestamps, status, logs, exit code
- Status: INIT, PENDING, RUNNING, SUCCEEDED, FAILED, CANCELLED
User sets desired_state=RUNNING
↓
Reconciler detects mismatch
↓
Calls sky.launch() with experiment flags
↓
Creates new Job record
↓
Poller updates job status from SkyPilot
↓
Experiment current_state updated
experiment = Experiment(
id="ppo_4layer",
name="PPO 4 Layers",
flags={
"trainer.losses.ppo.enabled": True,
"policy_architecture.core_resnet_layers": 4,
},
base_command="lt",
run_name="daveey.ppo_4layer",
nodes=4,
gpus=4,
desired_state=DesiredState.RUNNING,
)
await db.save_experiment(experiment)for layers in [1, 4, 16, 64]:
experiment = Experiment(
id=f"ppo_{layers}layer",
name=f"PPO {layers} Layers",
flags={
"trainer.losses.ppo.enabled": True,
"policy_architecture.core_resnet_layers": layers,
},
nodes=4,
gpus=4,
desired_state=DesiredState.RUNNING,
)
await db.save_experiment(experiment)
# Reconciler automatically launches all experiments!jobs = await db.get_jobs_for_experiment("ppo_4layer", limit=10)
for job in jobs:
print(f"{job.id}: {job.status} (exit={job.exit_code})")GET /api/experiments- List all experimentsPOST /api/experiments- Create new experimentGET /api/experiments/{id}- Get experiment detailsDELETE /api/experiments/{id}- Delete experimentPOST /api/experiments/{id}/state- Update desired stateGET /api/experiments/{id}/status- Full status with current jobGET /api/experiments/{id}/jobs- Job historyPOST /api/experiments/{id}/flags- Update flags
POST /api/jobs/{id}/cancel- Cancel running job
GET /api/health- System health statusGET /api/flag-schemas- Flag metadata for autocompletePOST /api/refresh- Force SkyPilot state refreshPOST /api/reconcile- Force reconciliation
Environment variables:
SKYDECK_DB_PATH: Database file path (default:skydeck.db)SKYDECK_POLL_INTERVAL: Poll interval in seconds (default: 30)SKYDECK_RECONCILE_INTERVAL: Reconcile interval in seconds (default: 60)
Command line options:
python -m skydeck.run --host 0.0.0.0 --port 8000 --db-path /path/to/db.sqlite# Install dev dependencies
uv pip install -e ".[dev]"
# Run tests
pytest
# Run with auto-reload
uv run uvicorn skydeck.app:app --reload- Experiments are templates, jobs are instances - Like classes vs objects
- Flags are first-class - Dynamically typed, inferred from Pydantic
- SQLite for everything - Survives restarts, single dependency
- Automatic reconciliation - Set desired state, system does the rest
- Job history tracking - Never lose experiment outcomes
- One running job per experiment - Enforced by reconciler
MIT