PM-Orderbook-Data-Collection

Real-time L2 orderbook data ingestion for Kalshi and Polymarket. Captures every tick, compresses 20x, enables timestamp-accurate backtesting.

See Developer Environment for setup details.

How It Works

WebSocket → Normalize → Kafka → Compress → Parquet → Query API (1ms) (1ms) (5ms) (async) (disk) (50ms)

1. Ingest

WebSocket connectors subscribe to Kalshi/Polymarket orderbook feeds. Every orderbook change (insert/update/delete) captured with nanosecond timestamps.

2. Normalize

Exchange-specific formats → unified schema. Validate price/quantity, detect gaps, assign event IDs.

3. Buffer

Write to Kafka (persistent queue). If storage crashes, no data loss. Kafka retains 7 days.

4. Compress

Full snapshot every x minutes
Deltas (changes only) between snapshots
Delta encoding + Zstandard + Parquet columnar
Result: 95% size reduction (20x compression)

5. Store

Partitioned Parquet files: /data/{exchange}/date={YYYY-MM-DD}/market={id}/
SQLite index tracks snapshot locations and sequence ranges.

6. Reconstruct

To get orderbook at timestamp T:

Load nearest snapshot before T
Apply deltas chronologically until T
Return exact orderbook state
Performance: 50-100ms per query

Architecture

Modular Monolith - Single process, clean boundaries, can extract later if needed.

Components:

Connectors: Kalshi and Polymarket WebSocket clients
Normalization: Parser and Validator for unified schema
Storage: Orderbook State manager, Snapshots, and Deltas

Data Flow: Connectors → Normalization → Storage

Repository Structure

src/connectors/ - Kalshi/Polymarket WebSocket
src/normalization/ - Parse & validate
src/compression/ - Snapshot + delta logic
src/storage/ - Parquet writer + index
src/reconstruction/ - Query & replay
src/main.py - Entry point
tests/ - Unit + integration tests
scripts/ - Ops tools (gap recovery, etc)
config/ - YAML configuration
data/ - Local storage (gitignored)

Timeline (1 Month)

Week 1: Both connectors + Kafka + raw storage
Week 2: Compression (snapshot+delta) + Parquet
Week 3: Reconstruction + Query API
Week 4: Monitoring + validation + production ready

Key Metrics

events_per_second: Should be consistent (100-1000/sec)
ingestion_latency_p99: Should be < 100ms
sequence_gaps_detected: Should be 0
compression_ratio: Should be > 10x
storage_growth_gb_per_day: Should be < 10 GB

Query API

# Get orderbook at specific timestamp
GET /api/v1/orderbook/{exchange}/{market}?timestamp={unix_ns}

# Stream for backtesting
POST /api/v1/backtest/stream
{
  "market_id": "PRES2028",
  "start": 1728700000000000000,
  "end": 1728800000000000000
}

Design Decisions
ChoiceWhyMonolithLatency critical, small team, can split laterKafkaPersistent queue, survives crashes, replay capabilitySnapshot+DeltaBalance compression vs reconstruction speedParquetColumnar = best compression + fast queriesNo RedisNot needed - backtesting queries are uniqueNo cachingQueries never repeat (different timestamps)

Developer Environment

We use anaconda for our python runtime env. Download the distribution for your platform. The distributions are also available to download for linux for use in developer runtimes and virtual machines. Do not use conda for production, use a global env for that. Distribution download page here: https://www.anaconda.com/download/success

Set up the conda environment for this project by running the following command from the root of the project:

conda env create -f environment.yml
# pm is the name of the environment
conda activate pm

When adding a new dependency—whether via Conda or pip—be sure to update the appropriate section in environment.yml.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Connector		Connector
Normalizer		Normalizer
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PM-Orderbook-Data-Collection

How It Works

1. Ingest

2. Normalize

3. Buffer

4. Compress

5. Store

6. Reconstruct

Architecture

Data Flow: Connectors → Normalization → Storage

Repository Structure

Timeline (1 Month)

Key Metrics

Query API

Developer Environment

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ibalgo/PM-Orderbook-Data-Collection

Folders and files

Latest commit

History

Repository files navigation

PM-Orderbook-Data-Collection

How It Works

1. Ingest

2. Normalize

3. Buffer

4. Compress

5. Store

6. Reconstruct

Architecture

Data Flow: Connectors → Normalization → Storage

Repository Structure

Timeline (1 Month)

Key Metrics

Query API

Developer Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages