A pipeline for extracting, canonicalizing, and deduplicating atomic facts from video transcripts, keyframe captions, and narratives.
This system processes text blobs from video content and produces canonical, cross-video facts with:
- Stable fact IDs (SHA1 of canonical string)
- Entity linking to Wikidata QIDs (or deterministic local IDs)
- Predicate mapping to FrameNet/VerbAtlas frames
- Normalized time and quantity qualifiers
- Verification scores against local evidence
- Near-duplicate detection and clustering across videos/channels
# Build Docker images
make build
# Start database and microservices
make up
# Run pipeline with toy data
make run-demo# Connect to database
make psql
# In psql:
SELECT * FROM facts LIMIT 5;
SELECT * FROM clusters;
# View statistics
make stats# Place your JSONL file in data/
make run BLOBS=data/my_blobs.jsonlInput blobs should be in JSONL format (one JSON per line):
{
"blob_id": "uuid",
"video_id": "vid_123",
"channel_id": "ch_42",
"source": "transcript|keyframe|narrative",
"text": "Alice assembles the drone with a Torx T20 bit.",
"t_start": 310.12,
"t_end": 328.48,
"frames": ["f_12_003", "f_12_007"]
}Canonical facts are stored in PostgreSQL with structure:
{
"fact_id": "sha1-of-canonical-string",
"video_id": "vid_123",
"channel_id": "ch_42",
"subject": {"qid": "Qxxx", "surface": "Alice", "el_conf": 0.93},
"predicate": {"frame": "Assemble", "sense": "assemble.01"},
"object": {"qid": "Qyyy", "surface": "drone", "el_conf": 0.81},
"qualifiers": {
"instrument": {"qid": "Qzzz", "surface": "Torx T20 bit"},
"temperature": {"value": 180, "unit": "celsius"}
},
"evidence": {
"blob_id": "uuid",
"sent_id": "vid_123:b42:s17",
"char_span": [215, 278]
},
"conf": {"el": 0.93, "srl": 0.88, "verify_support": 0.86},
"canonical_string": "S:Qxxx|P:Assemble|O:Qyyy|Q:{...}"
}blob.jsonl
│
▼
┌─────────────┐
│ Ingest │ Read & validate blobs
└─────┬───────┘
│
▼
┌─────────────┐
│ Preprocess │ Normalize text, split sentences
└─────┬───────┘
│
▼
┌─────────────┐
│ SRL │ Extract predicate-argument structures
└─────┬───────┘
│
▼
┌─────────────┐
│ Canonicalize│ Link entities, map predicates, normalize literals
└─────┬───────┘
│
▼
┌─────────────┐
│ Verify │ Score support with local NLI
└─────┬───────┘
│
▼
┌─────────────┐
│ Dedup │ MinHash LSH + embedding tie-break
└─────┬───────┘
│
▼
┌─────────────┐
│ Store │ PostgreSQL + Qdrant
└─────────────┘
- Primary: AllenNLP transformer-srl
- Runs as microservice on port 8001
- Returns predicate and arguments (A0, A1, A2, etc.)
- Primary: REL (Radboud Entity Linker)
- Fallback: BLINK, Bootleg, or deterministic local IDs
- Links surface forms to Wikidata QIDs
- Primary: HeidelTime
- Runs as microservice on port 8002
- Normalizes temporal expressions to ISO 8601
- Library: quantulum3 + Pint
- Converts to canonical units (celsius, millimeter, gram, second)
- Method: FEVER/FActScore style
- Uses sentence-transformers for retrieval
- roberta-large-mnli for NLI scoring
- Method: MinHash LSH + embedding fallback
- datasketch for MinHash
- sentence-transformers for semantic similarity
Edit configs/app.yaml to customize:
srl:
endpoint: "http://srl:8001/predict"
entity_linking:
min_conf_accept: 0.75 # Minimum confidence for QID linking
verification:
enable: true
min_support: 0.6 # Minimum NLI support score
dedup:
minhash:
n_perm: 128
bands: 32
jaccard_equiv: 0.85
embed_cosine_min: 0.84-
REL Entity Linking Data (if using REL):
# Download from https://github.com/informagi/REL # Place in /data/rel/
-
Sentence Transformers (auto-downloaded):
sentence-transformers/all-MiniLM-L6-v2
-
NLI Model (auto-downloaded):
roberta-large-mnli
-
HeidelTime (optional):
- Download from https://github.com/HeidelTime/heideltime
- Place jar in
resources/heideltime/
# In Docker
make test
# Locally
make test-local# Install dependencies
pip install -r requirements.txt
# Set Python path
export PYTHONPATH=.
# Run CLI
python src/cli.py --blobs data/blobs.jsonl --config configs/app.yamlIf REL data is missing, the system falls back to deterministic local IDs:
WARNING: REL not installed, using stub entity linker
This is fine for testing. For production, download REL indices.
If HeidelTime fails:
WARNING: HeidelTime request failed: ...
Check that the jar is at resources/heideltime/heideltime-standalone.jar or set:
timex:
mode: "none" # Disable HeidelTimeIf NLI model runs out of memory:
- Reduce batch size
- Use smaller model:
verification: nli_model: "distilroberta-base"
- Disable verification:
verification: enable: false
Ensure PostgreSQL is running:
docker compose -f docker/docker-compose.yml up -d db
docker compose -f docker/docker-compose.yml logs dbIf SRL is slow to respond:
- Increase timeout in
src/srl/client.py - Check SRL service logs:
make logs-srl
Generate LLM-based test data for validation:
# Generate 20 challenging test blobs
make generate-test-data N=20 CHALLENGING=1
# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
python scripts/generate_test_data.py --count 20 --challengingTest data is saved to resources/generated_data/XXXX/ with:
blobs.jsonl- Test blobsmetadata.json- Generation metadata
Run comprehensive tests on saved datasets:
# Test all saved datasets
make test-harness SCAN=1
# Test specific dataset
make test-harness DATASET=0001
# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
python scripts/test_harness.py --scan --run-pipeline --analyzeResults are saved to:
test_results/test_results.json- Summaryresources/generated_data/XXXX/analysis.json- Per-dataset analysis
Run pytest unit tests:
make test
# Or directly
docker compose -f docker/docker-compose.yml run --rm pipeline \
python -m pytest tests/ -v- Generate test data:
make generate-test-data N=50 CHALLENGING=1 - Run pipeline:
make test-harness SCAN=1 - Review results:
cat test_results/test_results.json | python -m json.tool
See TEST_DATA_WORKFLOW.md and QUICK_TEST_GUIDE.md for detailed testing documentation.
python src/cli.py \
--blobs /path/to/blobs.jsonl \
--config configs/app.yaml \
--init-db \
--log-level DEBUGSee src/storage/ddl.sql for complete schema.
Key tables:
entities- Linked entities with QIDs/local IDsvideos- Video metadatafacts- Canonical facts with embeddingsfact_minhash- LSH band hashesclusters- Deduplicated fact clusters
MIT License
- Fork the repository
- Create a feature branch
- Run tests:
make test - Submit a pull request