A streaming service data aggregator that scrapes, validates, and stores movie and TV show information from platforms like JustWatch. Uses LLM-powered extraction (via Ollama) and TMDB for data validation and enrichment.
Run the application directly on your machine with hot-reload:
# 1. Copy environment file
cp .env.example .env
# 2. Install dependencies
make install
# 3. Install Playwright browsers (for scraping)
make playwright-install
# 4. Start PostgreSQL
make db-up
# 5. Run migrations
make upgrade
# 6. Start the API server (Terminal 1)
make dev
# 7. Start workers (Terminal 2)
make worker
# 8. Start scheduler (Terminal 3, optional)
make schedulerRun everything in Docker with hot-reload:
# 1. Copy environment file
cp .env.example .env
# 2. Start all services (db + api + worker + scheduler)
make docker-dev
# View logs
make docker-dev-logs
# Stop services
make docker-dev-downThe API will be available at http://localhost:8000
API docs: http://localhost:8000/docs
| Command | Description |
|---|---|
make install |
Install dependencies with uv |
make dev |
Run FastAPI with hot-reload |
make worker |
Start background job workers |
make scheduler |
Start job scheduler (APScheduler) |
make db-up |
Start PostgreSQL container |
make db-down |
Stop PostgreSQL container |
make up |
Full startup (db + migrations + dev) |
| Command | Description |
|---|---|
make lint |
Run ruff linter with auto-fix |
make format |
Format code with ruff |
make typecheck |
Run ty type checker |
make check |
Run all checks (format + lint + typecheck) |
make hooks-install |
Install pre-commit hooks |
make hooks-uninstall |
Uninstall pre-commit hooks |
| Command | Description |
|---|---|
make test |
Run tests with pytest |
make test-cov |
Run tests with coverage report |
| Command | Description |
|---|---|
make migrate msg="description" |
Create new Alembic migration |
make upgrade |
Apply pending migrations |
make downgrade |
Rollback last migration |
| Command | Description |
|---|---|
make docker-build |
Build Docker images with latest tag |
make docker-build-tag |
Build with timestamp tag (YYYYMMDD-HHMMSS) |
make docker-up |
Start all containers (production mode) |
make docker-down |
Stop all containers |
make docker-logs |
Follow logs from all containers |
make docker-logs-api |
Follow API container logs |
make docker-logs-worker |
Follow worker container logs |
make docker-logs-scheduler |
Follow scheduler container logs |
make docker-dev |
Start dev environment with hot-reload |
make docker-dev-down |
Stop dev environment |
make docker-dev-logs |
Follow dev environment logs |
make logs-up |
Start logging stack (Loki + Grafana) |
make logs-up-dev |
Start logging stack for dev environment |
make logs-ui |
Open Grafana UI in browser |
make docker-deploy TAG=xxx |
Deploy with specific image tag |
make docker-recreate |
Force recreate containers |
make docker-prune |
Remove dangling Docker images |
streamvault/
├── app/
│ ├── core/ # Configuration and database setup
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic schemas
│ ├── routers/ # API endpoints
│ ├── services/ # Business logic
│ ├── workers/ # Background job workers
│ └── migrations/ # Alembic migrations
├── tests/ # Test suite
├── docker-compose.yml # PostgreSQL service
├── Makefile # Development commands
└── pyproject.toml # Project configuration
- FastAPI - Web framework
- Pydantic - Data validation
- SQLAlchemy - Async ORM
- Alembic - Database migrations
- PostgreSQL - Database
- uv - Package manager
- ruff - Linter and formatter
- ty - Type checker
- pytest - Testing framework
- LangChain - LLM orchestration
- Ollama - Local LLM runtime
- Playwright - Browser automation
StreamVault requires an external Ollama instance for LLM-powered data extraction. Ollama runs separately on the host machine (not in Docker).
Install Ollama from ollama.com and pull the required model:
ollama pull qwen3:30bSet the Ollama endpoint in your .env file:
# Local Ollama (production)
OLLAMA_HOST=http://localhost:11434
# Remote Ollama (development)
OLLAMA_HOST=http://10.0.0.139:11434All configuration is done via environment variables in .env:
| Variable | Default | Description |
|---|---|---|
POSTGRES_USER |
postgres |
PostgreSQL username |
POSTGRES_PASSWORD |
postgres |
PostgreSQL password |
POSTGRES_DB |
streamvault |
Database name |
POSTGRES_HOST |
localhost |
Database host |
POSTGRES_PORT |
5432 |
Database port |
APP_OLLAMA_HOST |
http://localhost:11434 |
Ollama API endpoint (use http://host.docker.internal:11434 for Docker) |
OLLAMA_MODEL |
qwen3:30b |
Default model for extraction |
TMDB_API_KEY |
- | TMDB API key (required for TMDB routes) |
QUEUE_WORKERS |
2 |
Number of worker tasks per process |
QUEUE_POLL_INTERVAL |
1.0 |
Seconds between queue polls |
SHARED_DIR |
/app/data/shared |
Shared storage directory |
| Method | Endpoint | Description |
|---|---|---|
| POST | /scraped/popular |
Scrape popular shows from a URL |
| POST | /scraped/top-ten |
Scrape top 10 movies and series |
| Method | Endpoint | Description |
|---|---|---|
| GET | /shows/scraped |
Get paginated list of scraped shows |
| GET | /shows/scraped/top-ten |
Get top 10 movies and series from latest batch |
| GET | /shows/scraped/{id} |
Get a single scraped show by ID |
| Method | Endpoint | Description |
|---|---|---|
| POST | /jobs |
Enqueue a new background job |
| GET | /jobs |
List jobs (with optional status filter) |
| GET | /jobs/{id} |
Get job status and result |
| POST | /jobs/{id}/retry |
Retry a failed job |
| Method | Endpoint | Description |
|---|---|---|
| GET | /tmdb/search/movies |
Search TMDB for movies by query |
| GET | /tmdb/search/tv |
Search TMDB for TV series by query |
Scrape popular shows:
curl -X POST http://localhost:8000/scraped/popular \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.justwatch.com/us/movies",
"origin": "justwatch",
"max_items": 10,
"download_tile_images": false,
"download_cast_images": false,
"download_background_images": false
}'Scrape top 10:
curl -X POST http://localhost:8000/scraped/top-ten \
-H "Content-Type: application/json" \
-d '{
"origin": "justwatch"
}'Get paginated scraped shows:
curl "http://localhost:8000/shows/scraped?skip=0&limit=20"Get top 10 movies and series:
curl http://localhost:8000/shows/scraped/top-tenGet a single show by ID:
curl http://localhost:8000/shows/scraped/1Enqueue a scrape job:
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{
"job_type": "scrape_top_ten",
"payload": {"origin": "justwatch"}
}'Enqueue a popular scrape job:
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{
"job_type": "scrape_popular",
"payload": {
"origin": "justwatch",
"url": "https://www.justwatch.com/us/movies"
}
}'Get job status:
curl http://localhost:8000/jobs/1List pending jobs:
curl "http://localhost:8000/jobs?status=pending"Retry a failed job:
curl -X POST http://localhost:8000/jobs/1/retrySearch for movies:
curl "http://localhost:8000/tmdb/search/movies?query=inception&page=1"Search for TV series:
curl "http://localhost:8000/tmdb/search/tv?query=breaking%20bad&page=1"Search with details:
curl "http://localhost:8000/tmdb/search/movies?query=inception&include_details=true"The application includes a PostgreSQL-based job queue for running long-running tasks in the background.
┌─────────────────┐ ┌─────────────────────┐
│ FastAPI API │ │ Worker Process │
│ │ │ │
│ POST /jobs ────┼──▶ │ ┌───────────────┐ │
│ GET /jobs/{id} │ │ │ Worker 1 │ │
│ │ │ │ Worker 2 │ │
└────────┬────────┘ │ │ ... │ │
│ │ └───────────────┘ │
▼ │ │ │
┌─────────┐ │ ▼ │
│ Postgres│◀────────┼── Poll & Process │
│ jobs │ │ │
└─────────┘ └─────────────────────┘
Workers run as a separate process from the API:
# Terminal 1: Start API
make dev
# Terminal 2: Start workers (default: 2 workers)
make worker
# Or with custom worker count
QUEUE_WORKERS=4 make worker| Job Type | Description |
|---|---|
scrape_top_ten |
Scrape top 10 movies and series |
scrape_popular |
Scrape popular shows from a URL |
validate_and_store |
Validate scraped data against TMDB and store |
| Variable | Default | Description |
|---|---|---|
QUEUE_WORKERS |
2 |
Number of worker tasks per process |
QUEUE_POLL_INTERVAL |
1.0 |
Seconds between queue polls |
The scheduler automatically enqueues jobs at scheduled times using APScheduler. It runs as a separate process and ensures scraping and validation tasks run twice daily.
Jobs are staggered to avoid conflicts and ensure scraping completes before validation:
| Time (Run 1) | Time (Run 2) | Job |
|---|---|---|
| 06:00 | 15:00 | Scrape top 10 |
| 06:30 | 15:30 | Scrape popular movies |
| 07:00 | 16:00 | Scrape popular series |
| 07:30 | 16:30 | Validate top 10 |
| 08:00 | 17:00 | Validate popular |
# Native development
make scheduler
# Docker development (included in docker-dev)
make docker-dev
# View scheduler logs
make docker-logs-scheduler┌─────────────────────────────────────────────────────────────┐
│ Scheduler Process │
├─────────────────────────────────────────────────────────────┤
│ APScheduler (AsyncIOScheduler) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Cron Triggers │ │
│ │ • 06:00/15:00 → scrape_top_ten │ │
│ │ • 06:30/15:30 → scrape_popular (movies) │ │
│ │ • 07:00/16:00 → scrape_popular (series) │ │
│ │ • 07:30/16:30 → validate_and_store (top_shows) │ │
│ │ • 08:00/17:00 → validate_and_store (popular_shows) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Enqueue Jobs │ │
│ │ to PostgreSQL │ │
│ └────────┬────────┘ │
└───────────────────────────┼─────────────────────────────────┘
│
▼
┌─────────────┐
│ Workers │ (process jobs)
└─────────────┘
Images are tagged with timestamps in format YYYYMMDD-HHMMSS:
# Build with auto-generated timestamp tag
make docker-build-tag
# Output: Built images with tag: 20251209-103045
# Build with latest tag (default)
make docker-build# 1. Copy and configure environment
cp .env.example .env
# Edit .env with production values (strong passwords, real API keys, etc.)
# 2. Build production images
make docker-build-tag
# 3. Start all services
make docker-up
# View logs
make docker-logs# Deploy using a specific tag
make docker-deploy TAG=20251209-103045When you've built new images and want to update running containers:
# Option 1: Force recreate all services (recommended)
make docker-recreate
# Option 2: Recreate specific services only
docker compose up -d --force-recreate api worker scheduler
# Option 3: Full restart
make docker-down
make docker-up# Remove dangling/unused images
make docker-prune
# More aggressive cleanup (removes all unused images)
docker image prune -a┌───────────────────────────────────────────────────────────────────────┐
│ Docker Compose Stack │
├───────────────────────────────────────────────────────────────────────┤
│ ┌───────────────┐ ┌─────────────────┐ ┌───────────────┐ ┌──────────┐ │
│ │streamvault-api│ │streamvault-worker│ │streamvault- │ │streamvault│ │
│ │ (FastAPI) │ │ (Job Workers) │ │ scheduler │ │ -db │ │
│ │ Port: 8000 │ │ │ │ (APScheduler) │ │(PostgreSQL)│ │
│ └───────┬───────┘ └────────┬─────────┘ └───────┬───────┘ └─────┬────┘ │
│ │ │ │ │ │
│ └──────────────────┴───────────────────┴───────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ ./data/postgres │ (DB persistence) │
│ │ ./data/shared │ (Images/files) │
│ └───────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
| Container | Description | Network Mode |
|---|---|---|
streamvault-api |
FastAPI application server | host |
streamvault-worker |
Background job workers | host |
streamvault-scheduler |
APScheduler job scheduler | host |
streamvault-db |
PostgreSQL 16 database | bridge (port 5432) |
| Volume | Path | Purpose |
|---|---|---|
./data/postgres |
/var/lib/postgresql/data |
Database files |
./data/shared |
/app/data/shared |
Downloaded images, scraped files |
Create a .env file with production values:
# Database (use strong passwords in production)
POSTGRES_USER=streamvault
POSTGRES_PASSWORD=<strong-password>
POSTGRES_DB=streamvault
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
# Ollama LLM
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen3:30b
# External APIs
TMDB_API_KEY=<your-tmdb-api-key>
# Worker Configuration
QUEUE_WORKERS=4
QUEUE_POLL_INTERVAL=1.0
# Storage
SHARED_DIR=/app/data/shared# Stop all containers
make docker-down
# Stop and remove volumes (WARNING: deletes data)
docker compose down -vThe project includes a centralized logging stack using Loki and Grafana for log aggregation, persistence, and visualization.
┌─────────────────────────────────────────────────────────────────┐
│ Logging Stack │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Promtail │───▶│ Loki │◀───│ Grafana │ │
│ │ (collector) │ │ (storage) │ │ (UI) │ │
│ └──────┬──────┘ │ Port: 3100 │ │ Port: 3001 │ │
│ │ └─────────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Docker Container Logs │ │
│ │ streamvault-api | streamvault-worker | postgres │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
# Start logging services (without restarting existing containers)
make logs-up-dev # For development
make logs-up # For production
# Open Grafana UI
make logs-ui # Opens http://localhost:3001- Open Grafana at http://localhost:3001
- Go to Explore (compass icon in sidebar)
- Select Loki as the datasource
- Use LogQL queries to filter logs
| Query | Description |
|---|---|
{container_name="streamvault-api-dev"} |
All API logs |
{container_name="streamvault-worker-dev"} |
All worker logs |
{container_name=~"streamvault.*"} |
All StreamVault logs |
{service="api"} |= "ERROR" |
API errors |
{service="worker"} |= "job" |
Worker job-related logs |
{container_name=~"streamvault.*"} |~ "(?i)error" |
Case-insensitive error search |
Logs are retained for 14 days by default. This can be configured in docker/loki/loki-config.yml:
limits_config:
retention_period: 336h # 14 days| Volume | Path | Purpose |
|---|---|---|
./data/loki |
/loki |
Log storage and indexes |
./data/grafana |
/var/lib/grafana |
Grafana dashboards and settings |
