GitHub - bugrauluyurt/streamvault: Streaming service data aggregator that scrapes, validates, and stores movie/TV show information using LLM-powered extraction (Ollama) and TMDB enrichment.

A streaming service data aggregator that scrapes, validates, and stores movie and TV show information from platforms like JustWatch. Uses LLM-powered extraction (via Ollama) and TMDB for data validation and enrichment.

Prerequisites

Python 3.13.5+
uv - Package manager
Docker - For PostgreSQL

Local Development

Option A: Native Development (Recommended for active development)

Run the application directly on your machine with hot-reload:

# 1. Copy environment file
cp .env.example .env

# 2. Install dependencies
make install

# 3. Install Playwright browsers (for scraping)
make playwright-install

# 4. Start PostgreSQL
make db-up

# 5. Run migrations
make upgrade

# 6. Start the API server (Terminal 1)
make dev

# 7. Start workers (Terminal 2)
make worker

# 8. Start scheduler (Terminal 3, optional)
make scheduler

Option B: Docker Development (Full containerized setup)

Run everything in Docker with hot-reload:

# 1. Copy environment file
cp .env.example .env

# 2. Start all services (db + api + worker + scheduler)
make docker-dev

# View logs
make docker-dev-logs

# Stop services
make docker-dev-down

The API will be available at http://localhost:8000

API docs: http://localhost:8000/docs

Development Commands

Command	Description
`make install`	Install dependencies with uv
`make dev`	Run FastAPI with hot-reload
`make worker`	Start background job workers
`make scheduler`	Start job scheduler (APScheduler)
`make db-up`	Start PostgreSQL container
`make db-down`	Stop PostgreSQL container
`make up`	Full startup (db + migrations + dev)

Code Quality

Command	Description
`make lint`	Run ruff linter with auto-fix
`make format`	Format code with ruff
`make typecheck`	Run ty type checker
`make check`	Run all checks (format + lint + typecheck)
`make hooks-install`	Install pre-commit hooks
`make hooks-uninstall`	Uninstall pre-commit hooks

Testing

Command	Description
`make test`	Run tests with pytest
`make test-cov`	Run tests with coverage report

Database

Command	Description
`make migrate msg="description"`	Create new Alembic migration
`make upgrade`	Apply pending migrations
`make downgrade`	Rollback last migration

Docker

Command	Description
`make docker-build`	Build Docker images with `latest` tag
`make docker-build-tag`	Build with timestamp tag (YYYYMMDD-HHMMSS)
`make docker-up`	Start all containers (production mode)
`make docker-down`	Stop all containers
`make docker-logs`	Follow logs from all containers
`make docker-logs-api`	Follow API container logs
`make docker-logs-worker`	Follow worker container logs
`make docker-logs-scheduler`	Follow scheduler container logs
`make docker-dev`	Start dev environment with hot-reload
`make docker-dev-down`	Stop dev environment
`make docker-dev-logs`	Follow dev environment logs
`make logs-up`	Start logging stack (Loki + Grafana)
`make logs-up-dev`	Start logging stack for dev environment
`make logs-ui`	Open Grafana UI in browser
`make docker-deploy TAG=xxx`	Deploy with specific image tag
`make docker-recreate`	Force recreate containers
`make docker-prune`	Remove dangling Docker images

Project Structure

streamvault/
├── app/
│   ├── core/           # Configuration and database setup
│   ├── models/         # SQLAlchemy models
│   ├── schemas/        # Pydantic schemas
│   ├── routers/        # API endpoints
│   ├── services/       # Business logic
│   ├── workers/        # Background job workers
│   └── migrations/     # Alembic migrations
├── tests/              # Test suite
├── docker-compose.yml  # PostgreSQL service
├── Makefile            # Development commands
└── pyproject.toml      # Project configuration

Tech Stack

FastAPI - Web framework
Pydantic - Data validation
SQLAlchemy - Async ORM
Alembic - Database migrations
PostgreSQL - Database
uv - Package manager
ruff - Linter and formatter
ty - Type checker
pytest - Testing framework
LangChain - LLM orchestration
Ollama - Local LLM runtime
Playwright - Browser automation

Ollama Setup

StreamVault requires an external Ollama instance for LLM-powered data extraction. Ollama runs separately on the host machine (not in Docker).

Installation

Install Ollama from ollama.com and pull the required model:

ollama pull qwen3:30b

Configuration

Set the Ollama endpoint in your .env file:

# Local Ollama (production)
OLLAMA_HOST=http://localhost:11434

# Remote Ollama (development)
OLLAMA_HOST=http://10.0.0.139:11434

Environment Variables

All configuration is done via environment variables in .env:

Variable	Default	Description
`POSTGRES_USER`	`postgres`	PostgreSQL username
`POSTGRES_PASSWORD`	`postgres`	PostgreSQL password
`POSTGRES_DB`	`streamvault`	Database name
`POSTGRES_HOST`	`localhost`	Database host
`POSTGRES_PORT`	`5432`	Database port
`APP_OLLAMA_HOST`	`http://localhost:11434`	Ollama API endpoint (use `http://host.docker.internal:11434` for Docker)
`OLLAMA_MODEL`	`qwen3:30b`	Default model for extraction
`TMDB_API_KEY`	-	TMDB API key (required for TMDB routes)
`QUEUE_WORKERS`	`2`	Number of worker tasks per process
`QUEUE_POLL_INTERVAL`	`1.0`	Seconds between queue polls
`SHARED_DIR`	`/app/data/shared`	Shared storage directory

API Endpoints

Scrape Routes (`/scraped`)

Method	Endpoint	Description
POST	`/scraped/popular`	Scrape popular shows from a URL
POST	`/scraped/top-ten`	Scrape top 10 movies and series

Shows Routes (`/shows`)

Method	Endpoint	Description
GET	`/shows/scraped`	Get paginated list of scraped shows
GET	`/shows/scraped/top-ten`	Get top 10 movies and series from latest batch
GET	`/shows/scraped/{id}`	Get a single scraped show by ID

Jobs Routes (`/jobs`)

Method	Endpoint	Description
POST	`/jobs`	Enqueue a new background job
GET	`/jobs`	List jobs (with optional status filter)
GET	`/jobs/{id}`	Get job status and result
POST	`/jobs/{id}/retry`	Retry a failed job

TMDB Routes (`/tmdb`)

Method	Endpoint	Description
GET	`/tmdb/search/movies`	Search TMDB for movies by query
GET	`/tmdb/search/tv`	Search TMDB for TV series by query

API Examples

Scrape Endpoints

Scrape popular shows:

curl -X POST http://localhost:8000/scraped/popular \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.justwatch.com/us/movies",
    "origin": "justwatch",
    "max_items": 10,
    "download_tile_images": false,
    "download_cast_images": false,
    "download_background_images": false
  }'

Scrape top 10:

curl -X POST http://localhost:8000/scraped/top-ten \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "justwatch"
  }'

Shows Endpoints

Get paginated scraped shows:

curl "http://localhost:8000/shows/scraped?skip=0&limit=20"

Get top 10 movies and series:

curl http://localhost:8000/shows/scraped/top-ten

Get a single show by ID:

curl http://localhost:8000/shows/scraped/1

Jobs Endpoints

Enqueue a scrape job:

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "job_type": "scrape_top_ten",
    "payload": {"origin": "justwatch"}
  }'

Enqueue a popular scrape job:

curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "job_type": "scrape_popular",
    "payload": {
      "origin": "justwatch",
      "url": "https://www.justwatch.com/us/movies"
    }
  }'

Get job status:

curl http://localhost:8000/jobs/1

List pending jobs:

curl "http://localhost:8000/jobs?status=pending"

Retry a failed job:

curl -X POST http://localhost:8000/jobs/1/retry

TMDB Endpoints

Search for movies:

curl "http://localhost:8000/tmdb/search/movies?query=inception&page=1"

Search for TV series:

curl "http://localhost:8000/tmdb/search/tv?query=breaking%20bad&page=1"

Search with details:

curl "http://localhost:8000/tmdb/search/movies?query=inception&include_details=true"

Background Job Queue

The application includes a PostgreSQL-based job queue for running long-running tasks in the background.

Architecture

┌─────────────────┐     ┌─────────────────────┐
│   FastAPI API   │     │   Worker Process    │
│                 │     │                     │
│  POST /jobs ────┼──▶  │  ┌───────────────┐  │
│  GET /jobs/{id} │     │  │ Worker 1      │  │
│                 │     │  │ Worker 2      │  │
└────────┬────────┘     │  │ ...           │  │
         │              │  └───────────────┘  │
         ▼              │         │           │
    ┌─────────┐         │         ▼           │
    │ Postgres│◀────────┼── Poll & Process    │
    │  jobs   │         │                     │
    └─────────┘         └─────────────────────┘

Running Workers

Workers run as a separate process from the API:

# Terminal 1: Start API
make dev

# Terminal 2: Start workers (default: 2 workers)
make worker

# Or with custom worker count
QUEUE_WORKERS=4 make worker

Job Types

Job Type	Description
`scrape_top_ten`	Scrape top 10 movies and series
`scrape_popular`	Scrape popular shows from a URL
`validate_and_store`	Validate scraped data against TMDB and store

Queue Configuration

Variable	Default	Description
`QUEUE_WORKERS`	`2`	Number of worker tasks per process
`QUEUE_POLL_INTERVAL`	`1.0`	Seconds between queue polls

Job Scheduler

The scheduler automatically enqueues jobs at scheduled times using APScheduler. It runs as a separate process and ensures scraping and validation tasks run twice daily.

Schedule

Jobs are staggered to avoid conflicts and ensure scraping completes before validation:

Time (Run 1)	Time (Run 2)	Job
06:00	15:00	Scrape top 10
06:30	15:30	Scrape popular movies
07:00	16:00	Scrape popular series
07:30	16:30	Validate top 10
08:00	17:00	Validate popular

Running the Scheduler

# Native development
make scheduler

# Docker development (included in docker-dev)
make docker-dev

# View scheduler logs
make docker-logs-scheduler

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Scheduler Process                       │
├─────────────────────────────────────────────────────────────┤
│  APScheduler (AsyncIOScheduler)                              │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  Cron Triggers                                       │    │
│  │  • 06:00/15:00 → scrape_top_ten                     │    │
│  │  • 06:30/15:30 → scrape_popular (movies)            │    │
│  │  • 07:00/16:00 → scrape_popular (series)            │    │
│  │  • 07:30/16:30 → validate_and_store (top_shows)     │    │
│  │  • 08:00/17:00 → validate_and_store (popular_shows) │    │
│  └─────────────────────────────────────────────────────┘    │
│                           │                                  │
│                           ▼                                  │
│                  ┌─────────────────┐                        │
│                  │  Enqueue Jobs   │                        │
│                  │  to PostgreSQL  │                        │
│                  └────────┬────────┘                        │
└───────────────────────────┼─────────────────────────────────┘
                            │
                            ▼
                    ┌─────────────┐
                    │  Workers    │ (process jobs)
                    └─────────────┘

Production Deployment

Building and Tagging Images

Images are tagged with timestamps in format YYYYMMDD-HHMMSS:

# Build with auto-generated timestamp tag
make docker-build-tag
# Output: Built images with tag: 20251209-103045

# Build with latest tag (default)
make docker-build

Starting Services

# 1. Copy and configure environment
cp .env.example .env
# Edit .env with production values (strong passwords, real API keys, etc.)

# 2. Build production images
make docker-build-tag

# 3. Start all services
make docker-up

# View logs
make docker-logs

Deploying a Specific Version

# Deploy using a specific tag
make docker-deploy TAG=20251209-103045

Updating Running Containers

When you've built new images and want to update running containers:

# Option 1: Force recreate all services (recommended)
make docker-recreate

# Option 2: Recreate specific services only
docker compose up -d --force-recreate api worker scheduler

# Option 3: Full restart
make docker-down
make docker-up

Cleaning Up Old Images

# Remove dangling/unused images
make docker-prune

# More aggressive cleanup (removes all unused images)
docker image prune -a

Production Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                        Docker Compose Stack                            │
├───────────────────────────────────────────────────────────────────────┤
│  ┌───────────────┐ ┌─────────────────┐ ┌───────────────┐ ┌──────────┐ │
│  │streamvault-api│ │streamvault-worker│ │streamvault-   │ │streamvault│ │
│  │   (FastAPI)   │ │  (Job Workers)   │ │  scheduler    │ │    -db   │ │
│  │  Port: 8000   │ │                  │ │ (APScheduler) │ │(PostgreSQL)│ │
│  └───────┬───────┘ └────────┬─────────┘ └───────┬───────┘ └─────┬────┘ │
│          │                  │                   │               │      │
│          └──────────────────┴───────────────────┴───────────────┘      │
│                                     │                                   │
│                           ┌─────────▼─────────┐                        │
│                           │  ./data/postgres  │ (DB persistence)       │
│                           │  ./data/shared    │ (Images/files)         │
│                           └───────────────────┘                        │
└───────────────────────────────────────────────────────────────────────┘

Container Details

Container	Description	Network Mode
`streamvault-api`	FastAPI application server	host
`streamvault-worker`	Background job workers	host
`streamvault-scheduler`	APScheduler job scheduler	host
`streamvault-db`	PostgreSQL 16 database	bridge (port 5432)

Data Persistence

Volume	Path	Purpose
`./data/postgres`	`/var/lib/postgresql/data`	Database files
`./data/shared`	`/app/data/shared`	Downloaded images, scraped files

Production Environment Variables

Create a .env file with production values:

# Database (use strong passwords in production)
POSTGRES_USER=streamvault
POSTGRES_PASSWORD=<strong-password>
POSTGRES_DB=streamvault
POSTGRES_HOST=localhost
POSTGRES_PORT=5432

# Ollama LLM
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=qwen3:30b

# External APIs
TMDB_API_KEY=<your-tmdb-api-key>

# Worker Configuration
QUEUE_WORKERS=4
QUEUE_POLL_INTERVAL=1.0

# Storage
SHARED_DIR=/app/data/shared

Stopping Services

# Stop all containers
make docker-down

# Stop and remove volumes (WARNING: deletes data)
docker compose down -v

Centralized Logging

The project includes a centralized logging stack using Loki and Grafana for log aggregation, persistence, and visualization.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Logging Stack                                │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │   Promtail  │───▶│    Loki     │◀───│      Grafana        │  │
│  │ (collector) │    │  (storage)  │    │       (UI)          │  │
│  └──────┬──────┘    │ Port: 3100  │    │    Port: 3001       │  │
│         │           └─────────────┘    └─────────────────────┘  │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Docker Container Logs                       │    │
│  │   streamvault-api  |  streamvault-worker  |  postgres   │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Starting the Logging Stack

# Start logging services (without restarting existing containers)
make logs-up-dev    # For development
make logs-up        # For production

# Open Grafana UI
make logs-ui        # Opens http://localhost:3001

Accessing Logs

Open Grafana at http://localhost:3001
Go to Explore (compass icon in sidebar)
Select Loki as the datasource
Use LogQL queries to filter logs

LogQL Query Examples

Query	Description
`{container_name="streamvault-api-dev"}`	All API logs
`{container_name="streamvault-worker-dev"}`	All worker logs
`{container_name=~"streamvault.*"}`	All StreamVault logs
`{service="api"} \|= "ERROR"`	API errors
`{service="worker"} \|= "job"`	Worker job-related logs
`{container_name=~"streamvault.*"} \|~ "(?i)error"`	Case-insensitive error search

Log Retention

Logs are retained for 14 days by default. This can be configured in docker/loki/loki-config.yml:

limits_config:
  retention_period: 336h # 14 days

Data Persistence

Volume	Path	Purpose
`./data/loki`	`/loki`	Log storage and indexes
`./data/grafana`	`/var/lib/grafana`	Grafana dashboards and settings

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.vscode		.vscode
app		app
assets		assets
docker		docker
http		http
tests		tests
.env.example		.env.example
.env.prod.example		.env.prod.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
alembic.ini		alembic.ini
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

bugrauluyurt/streamvault

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Local Development

Option A: Native Development (Recommended for active development)

Option B: Docker Development (Full containerized setup)

Development Commands

Code Quality

Testing

Database

Docker

Project Structure

Tech Stack

Ollama Setup

Installation

Configuration

Environment Variables

API Endpoints

Scrape Routes (/scraped)

Shows Routes (/shows)

Jobs Routes (/jobs)

TMDB Routes (/tmdb)

API Examples

Scrape Endpoints

Shows Endpoints

Jobs Endpoints

TMDB Endpoints

Background Job Queue

Architecture

Running Workers

Job Types

Queue Configuration

Job Scheduler

Schedule

Running the Scheduler

Architecture

Production Deployment

Building and Tagging Images

Starting Services

Deploying a Specific Version

Updating Running Containers

Cleaning Up Old Images

Production Architecture

Container Details

Data Persistence

Production Environment Variables

Stopping Services

Centralized Logging

Architecture

Starting the Logging Stack

Accessing Logs

LogQL Query Examples

Log Retention

Data Persistence

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Scrape Routes (`/scraped`)

Shows Routes (`/shows`)

Jobs Routes (`/jobs`)

TMDB Routes (`/tmdb`)

Packages