Video Facts Canonicalizer

A pipeline for extracting, canonicalizing, and deduplicating atomic facts from video transcripts, keyframe captions, and narratives.

Overview

This system processes text blobs from video content and produces canonical, cross-video facts with:

Stable fact IDs (SHA1 of canonical string)
Entity linking to Wikidata QIDs (or deterministic local IDs)
Predicate mapping to FrameNet/VerbAtlas frames
Normalized time and quantity qualifiers
Verification scores against local evidence
Near-duplicate detection and clustering across videos/channels

Quick Start

1. Build and Start Services

# Build Docker images
make build

# Start database and microservices
make up

# Run pipeline with toy data
make run-demo

2. Check Results

# Connect to database
make psql

# In psql:
SELECT * FROM facts LIMIT 5;
SELECT * FROM clusters;

# View statistics
make stats

3. Run with Custom Data

# Place your JSONL file in data/
make run BLOBS=data/my_blobs.jsonl

Input Format

Input blobs should be in JSONL format (one JSON per line):

{
  "blob_id": "uuid",
  "video_id": "vid_123",
  "channel_id": "ch_42",
  "source": "transcript|keyframe|narrative",
  "text": "Alice assembles the drone with a Torx T20 bit.",
  "t_start": 310.12,
  "t_end": 328.48,
  "frames": ["f_12_003", "f_12_007"]
}

Output Format

Canonical facts are stored in PostgreSQL with structure:

{
  "fact_id": "sha1-of-canonical-string",
  "video_id": "vid_123",
  "channel_id": "ch_42",
  "subject": {"qid": "Qxxx", "surface": "Alice", "el_conf": 0.93},
  "predicate": {"frame": "Assemble", "sense": "assemble.01"},
  "object": {"qid": "Qyyy", "surface": "drone", "el_conf": 0.81},
  "qualifiers": {
    "instrument": {"qid": "Qzzz", "surface": "Torx T20 bit"},
    "temperature": {"value": 180, "unit": "celsius"}
  },
  "evidence": {
    "blob_id": "uuid",
    "sent_id": "vid_123:b42:s17",
    "char_span": [215, 278]
  },
  "conf": {"el": 0.93, "srl": 0.88, "verify_support": 0.86},
  "canonical_string": "S:Qxxx|P:Assemble|O:Qyyy|Q:{...}"
}

Architecture

blob.jsonl
    │
    ▼
┌─────────────┐
│   Ingest    │  Read & validate blobs
└─────┬───────┘
      │
      ▼
┌─────────────┐
│  Preprocess │  Normalize text, split sentences
└─────┬───────┘
      │
      ▼
┌─────────────┐
│    SRL      │  Extract predicate-argument structures
└─────┬───────┘
      │
      ▼
┌─────────────┐
│ Canonicalize│  Link entities, map predicates, normalize literals
└─────┬───────┘
      │
      ▼
┌─────────────┐
│   Verify    │  Score support with local NLI
└─────┬───────┘
      │
      ▼
┌─────────────┐
│   Dedup     │  MinHash LSH + embedding tie-break
└─────┬───────┘
      │
      ▼
┌─────────────┐
│   Store     │  PostgreSQL + Qdrant
└─────────────┘

Components

SRL (Semantic Role Labeling)

Primary: AllenNLP transformer-srl
Runs as microservice on port 8001
Returns predicate and arguments (A0, A1, A2, etc.)

Entity Linking

Primary: REL (Radboud Entity Linker)
Fallback: BLINK, Bootleg, or deterministic local IDs
Links surface forms to Wikidata QIDs

Time Normalization

Primary: HeidelTime
Runs as microservice on port 8002
Normalizes temporal expressions to ISO 8601

Quantity Normalization

Library: quantulum3 + Pint
Converts to canonical units (celsius, millimeter, gram, second)

Verification

Method: FEVER/FActScore style
Uses sentence-transformers for retrieval
roberta-large-mnli for NLI scoring

Deduplication

Method: MinHash LSH + embedding fallback
datasketch for MinHash
sentence-transformers for semantic similarity

Configuration

Edit configs/app.yaml to customize:

srl:
  endpoint: "http://srl:8001/predict"

entity_linking:
  min_conf_accept: 0.75  # Minimum confidence for QID linking

verification:
  enable: true
  min_support: 0.6  # Minimum NLI support score

dedup:
  minhash:
    n_perm: 128
    bands: 32
    jaccard_equiv: 0.85
  embed_cosine_min: 0.84

Model Artifacts

Required Downloads

REL Entity Linking Data (if using REL):

# Download from https://github.com/informagi/REL
# Place in /data/rel/

Sentence Transformers (auto-downloaded):
- sentence-transformers/all-MiniLM-L6-v2
NLI Model (auto-downloaded):
- roberta-large-mnli
HeidelTime (optional):
- Download from https://github.com/HeidelTime/heideltime
- Place jar in resources/heideltime/

Development

Run Tests

# In Docker
make test

# Locally
make test-local

Local Setup

# Install dependencies
pip install -r requirements.txt

# Set Python path
export PYTHONPATH=.

# Run CLI
python src/cli.py --blobs data/blobs.jsonl --config configs/app.yaml

Troubleshooting

REL Entity Linker Not Found

If REL data is missing, the system falls back to deterministic local IDs:

WARNING: REL not installed, using stub entity linker

This is fine for testing. For production, download REL indices.

HeidelTime Jar Path

If HeidelTime fails:

WARNING: HeidelTime request failed: ...

Check that the jar is at resources/heideltime/heideltime-standalone.jar or set:

timex:
  mode: "none"  # Disable HeidelTime

NLI Model OOM

If NLI model runs out of memory:

Reduce batch size

Use smaller model:

verification:
  nli_model: "distilroberta-base"

Disable verification:
```
verification:
  enable: false
```

Database Connection Failed

Ensure PostgreSQL is running:

docker compose -f docker/docker-compose.yml up -d db
docker compose -f docker/docker-compose.yml logs db

SRL Service Timeout

If SRL is slow to respond:

Increase timeout in src/srl/client.py
Check SRL service logs:
```
make logs-srl
```

Testing

Generate Test Data

Generate LLM-based test data for validation:

# Generate 20 challenging test blobs
make generate-test-data N=20 CHALLENGING=1

# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
  python scripts/generate_test_data.py --count 20 --challenging

Test data is saved to resources/generated_data/XXXX/ with:

blobs.jsonl - Test blobs
metadata.json - Generation metadata

Run Test Harness

Run comprehensive tests on saved datasets:

# Test all saved datasets
make test-harness SCAN=1

# Test specific dataset
make test-harness DATASET=0001

# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
  python scripts/test_harness.py --scan --run-pipeline --analyze

Results are saved to:

test_results/test_results.json - Summary
resources/generated_data/XXXX/analysis.json - Per-dataset analysis

Unit Tests

Run pytest unit tests:

make test

# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
  python -m pytest tests/ -v

Testing Workflow

Generate test data: make generate-test-data N=50 CHALLENGING=1
Run pipeline: make test-harness SCAN=1
Review results: cat test_results/test_results.json | python -m json.tool

See TEST_DATA_WORKFLOW.md and QUICK_TEST_GUIDE.md for detailed testing documentation.

API Reference

CLI Options

python src/cli.py \
  --blobs /path/to/blobs.jsonl \
  --config configs/app.yaml \
  --init-db \
  --log-level DEBUG

Database Schema

See src/storage/ddl.sql for complete schema.

Key tables:

entities - Linked entities with QIDs/local IDs
videos - Video metadata
facts - Canonical facts with embeddings
fact_minhash - LSH band hashes
clusters - Deduplicated fact clusters

License

MIT License

Contributing

Fork the repository
Create a feature branch
Run tests: make test
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
data		data
docker		docker
resources/generated_data		resources/generated_data
scripts		scripts
src		src
srl_service		srl_service
test_results		test_results
tests		tests
timex_service		timex_service
web		web
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
QUICK_TEST_GUIDE.md		QUICK_TEST_GUIDE.md
README.md		README.md
TESTING_STRATEGY.md		TESTING_STRATEGY.md
TEST_DATA_WORKFLOW.md		TEST_DATA_WORKFLOW.md
requirements.txt		requirements.txt

RoboBaby/FactExtractV3

Folders and files

Latest commit

History

Repository files navigation