A dockerized Python script for harvesting Linked Data Event Streams (LDES) endpoints and caching RDF data as N-Triples files.
- Full LDES harvesting: Automatically follows pagination and harvests all members from an LDES endpoint
- N-Triples output: Converts JSON-LD members to N-Triples format for easy processing (eg. import into triplestore)
- Resume capability: Automatically resumes from where it left off if interrupted (harvesting repositories can take days)
- Statistics & logging: Detailed logging of fetched URLs and harvesting statistics
- Docker support: Runs in a containerized environment for easy deployment (no Python venv necessary)
- Docker installed on your system
- An LDES endpoint URL (like https://data.rijksmuseum.nl/ldes/dataset/260242/collection.json which is the Delfts aardewerk event stream from the Rijksmuseum)
docker build -t ldes-harvester .docker run -v "$(pwd)/cache:/app/cache" ldes-harvester \
https://example.com/ldes# Disable resume (start fresh)
docker run -v "$(pwd)/cache:/app/cache" ldes-harvester \
--no-resume https://example.com/ldes
# Custom cache directory (inside container)
docker run -v "$(pwd)/my-cache:/data" ldes-harvester \
--cache-dir /data https://example.com/ldes
# View help
docker run ldes-harvester --helpThe harvester creates the following structure in the cache directory:
cache/
├── state.json # Resume state (processed pages and members)
├── harvester.log # Detailed log file
├── <hash1>.nt # N-Triples file for member 1
├── <hash2>.nt # N-Triples file for member 2
└── ...
Each member is saved as a separate N-Triples file with a filename derived from the SHA-256 hash of the member's ID. This ensures:
- Unique filenames for each member
- Deduplication (same member won't be saved twice)
- Consistent naming across runs
The harvester automatically saves its state to cache/state.json after every 10 pages. This includes:
- List of processed page URLs
- List of processed member IDs
- Current statistics
If the harvester is interrupted:
- Simply run the same command again
- It will automatically resume from the last saved state
- Already processed members and pages will be skipped
To disable resume and start fresh, use the --no-resume flag.
At the end of each run, the harvester displays:
- Members harvested: Total number of LDES members saved
- Pages processed: Total number of LDES pages fetched
- Errors encountered: Number of errors during harvesting
- Duration: Total time taken
- Cache directory: Location of saved files
The harvester provides two levels of logging:
-
Console Output (INFO level):
- URLs being fetched
- Number of members found on each page
- Progress updates
- Final statistics
-
Log File (
cache/harvester.log):- Detailed DEBUG level information
- Complete error traces
- Timestamps for all operations
LDES (Linked Data Event Streams) is a specification for publishing collections of versioned objects:
- Entry Point: The collection URL returns metadata and links to time-based pages
- Pagination: Each page contains members and links to next pages via
relationobjects - Members: Individual data objects in JSON-LD format with timestamps
- Traversal: The harvester follows all pagination links recursively
Key Difference from IIIF Change Discovery: LDES provides events that embed full object representations (or object fragments), so the evolving state of an object is delivered directly within the stream without requiring separate dereferencing of object URIs.
When you harvest objects and convert them to N-Triples, you may notice that objects published using one ontology (like Linked Art) appear in the N-Triples output using a different ontology (like CIDOC-CRM). This is not an error but a fundamental feature of JSON-LD and semantic web technologies.
JSON-LD Context Expansion: JSON-LD documents include a @context that maps short property names to full URIs. When the RDFlib library parses JSON-LD and converts it to N-Triples, it:
- Expands all compact property names to their full URI form
- Resolves all namespace prefixes to complete URIs
- Flattens the nested JSON structure into RDF triples
A Linked Art object might use compact notation:
{
"@context": "https://linked.art/ns/v1/linked-art.json",
"type": "HumanMadeObject",
"identified_by": [...]
}But in N-Triples, this becomes:
<uri> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E22_Man-Made_Object> .
<uri> <http://www.cidoc-crm.org/cidoc-crm/P1_is_identified_by> ...
Linked Art is built on CIDOC-CRM: Linked Art is a community-developed profile that uses CIDOC-CRM as its underlying ontology. The Linked Art context maps friendly property names (like identified_by) to their corresponding CIDOC-CRM properties (like P1_is_identified_by).
This means:
- Linked Art = Human-friendly JSON-LD with simplified property names
- CIDOC-CRM = The formal ontology underneath with full semantic definitions
- N-Triples = The expanded, fully-qualified RDF representation
- Semantic Interoperability: Different JSON-LD profiles can map to the same underlying ontology
- Machine Readability: N-Triples contain unambiguous, fully-qualified URIs
- SPARQL Querying: The expanded form makes it easy to query across different source formats
- Standard Compliance: CIDOC-CRM URIs are standardized and widely recognized
- If you're importing into a triplestore, use the N-Triples directly - they contain the complete semantic information
- If you need human-readable property names, retain the original JSON-LD with its context
- The ontology URIs (CIDOC-CRM, Getty AAT, etc.) in the N-Triples are the "true" semantic representation of your data
- Python 3.11: Runtime environment
- requests: HTTP client for fetching LDES pages
- rdflib: RDF parsing and N-Triples serialization
The harvester follows this workflow:
- Fetch the LDES collection entry point
- Extract initial page URLs from relations
- For each page:
- Fetch and parse JSON-LD content
- Extract members
- Convert each member to RDF graph
- Serialize to N-Triples format
- Save with hash-based filename
- Follow pagination links recursively
- Save state periodically for resume capability
# Install dependencies
pip install -r requirements.txt
# Run harvester
python harvester.py --cache-dir ./cache \
https://data.rijksmuseum.nl/ldes/dataset/260250/collection.jsonThis project is provided as-is for harvesting publicly available LDES endpoints.