Skip to content

Conversation

@stephengoldbaum
Copy link

Summary

This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.

What's New

Core Features

  • RDF Ingestion Source (type: rdf) - Native DataHub plugin for RDF/OWL ontologies
  • Multiple Format Support - Turtle, RDF/XML, JSON-LD, N3, N-Triples
  • Flexible Source Loading - Files, directories (with recursive option), URLs, and comma-separated file lists
  • Glossary Term Extraction - Converts skos:Concept and owl:Class to DataHub GlossaryTerms
  • Glossary Node Hierarchy - Auto-creates glossary nodes from IRI path hierarchies
  • Term Relationships - Extracts skos:broader and skos:narrower relationships as isRelatedTerms
  • Stateful Ingestion - Full support for stale entity removal via stateful_ingestion config
  • Platform Instance Support - Configurable platform instances via platform_instance config

Architecture

  • Modular Entity Processing - Clean separation with extractors, converters, and MCP builders
  • Dependency-Based Processing - Topological sort for correct entity processing order
  • Test Connection Support - Implements test_connection() for connection validation

Capabilities

The source supports the following DataHub capabilities:

Capability Status Notes
Glossary Terms Enabled by default
Glossary Nodes Auto-created from IRI path hierarchies
Term Relationships Supports skos:broader and skos:narrower
Detect Deleted Entities Requires stateful_ingestion.enabled: true
Platform Instance Supported via platform_instance config
Extract Descriptions Enabled by default (from skos:definition or rdfs:comment)
Data Domain Not applicable (domains used internally for hierarchy)
Dataset Profiling Not applicable
Extract Lineage Not in MVP
Extract Ownership Not supported
Extract Tags Not supported

Testing

Test Coverage

  • 126 unit tests - Comprehensive coverage of core functionality, error handling, and edge cases
  • 16 integration tests - End-to-end testing with golden file validation
  • Test scenarios include:
    • Simple glossary ingestion
    • Glossary with relationships
    • Glossary with domains
    • Multiple RDF formats (Turtle, RDF/XML, JSON-LD)
    • Recursive directory ingestion
    • Export filtering (export_only, skip_export)
    • Stateful ingestion with stale entity removal
    • Error handling (missing files, malformed RDF, invalid formats)
    • Large file performance warnings
    • Path traversal protection

Test Files

  • tests/unit/rdf/ - Unit tests for individual components
  • tests/integration/rdf/ - Integration tests with golden file validation
  • All tests passing ✅

Documentation

User Documentation

  • docs/sources/rdf/rdf.md - Comprehensive user guide (489 lines)
    • Quickstart guide
    • Configuration reference
    • RDF format and source types
    • Dialects and selective export
    • Stateful ingestion guide
    • Example RDF files
    • IRI-to-URN mapping
    • Glossary node hierarchy
    • Supported vocabularies
    • Limitations and troubleshooting

Recipe Examples

  • docs/sources/rdf/rdf_recipe.yml - Example recipes for basic and stateful ingestion

Integration Test Documentation

  • tests/integration/rdf/README.md - Detailed guide for running integration tests

Configuration Example

source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed

Technical Notes

Security & Performance

  • URL Loading Security - Timeout limits, size limits, and redirect limits for safe URL loading
  • Path Traversal Protection - Configurable enforcement to prevent access outside intended directories
  • Memory Efficiency - Generator patterns for work unit generation and streaming for large URL downloads
  • Format Validation - Validates RDF formats before processing

Code Quality

  • Thread-Safe Registry - Entity registry uses double-checked locking pattern for thread safety
  • Component Validation - Validates registered components for entity type consistency
  • Type Safety - Complete type hints with proper forward references for MCP return types
  • Error Handling - Granular error reporting with structured logs and context
  • URN Generation - Standardized URN format using dot notation, proper encoding, and GUID fallback for non-ASCII characters

New Files

  • src/datahub/ingestion/source/rdf/ingestion/rdf_source.py - Main source implementation
  • src/datahub/ingestion/source/rdf/core/rdf_loader.py - RDF loading utilities with security
  • src/datahub/ingestion/source/rdf/core/urn_generator.py - URN generation with encoding
  • src/datahub/ingestion/source/rdf/entities/base.py - Base interfaces for entity processing
  • src/datahub/ingestion/source/rdf/entities/registry.py - Thread-safe entity registry
  • docs/sources/rdf/rdf.md - User documentation
  • docs/sources/rdf/rdf_recipe.yml - Recipe examples
  • tests/integration/rdf/test_rdf_source.py - Integration tests
  • tests/unit/rdf/ - Unit tests (multiple files)

Modified Files

  • setup.py - Added RDF source to entry points (line 862)

Breaking Changes

None - This is a new feature addition with no breaking changes to existing functionality.

Support Status

The RDF source is marked as INCUBATING (SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.

Checklist

  • Plugin registered in setup.py
  • Source class properly decorated (@platform_name, @config_class, @support_status)
  • Capability decorators added
  • Stateful ingestion implemented
  • test_connection() implemented
  • Comprehensive error handling
  • Security measures (timeouts, size limits, path traversal protection)
  • Memory-efficient patterns (generators, streaming)
  • Thread-safe registry
  • Type hints complete
  • All tests passing
  • User documentation complete
  • Integration test documentation
  • Code follows DataHub standards
  • Linting passes

- Add RDF ingestion source for glossary terms, domains, and relationships
- Streamlined architecture: extractors return DataHub AST directly
- Removed unnecessary abstraction layers (RDF AST, converters where not needed)
- Support for SKOS, OWL, and other RDF vocabularies
- Comprehensive test coverage with 128 passing tests
- UI integration for RDF source configuration
- Remove build_relationship_mcps() method from GlossaryTermMCPBuilder
- Update tests to use RelationshipMCPBuilder directly
- Clean separation: glossary_term handles terms, relationship handles relationships
- Refactor _generate_workunits_from_ast to reduce complexity
…ance

- Enhance error handling in RDFSource to provide actionable messages for missing files, malformed RDF, and invalid formats.
- Implement unit tests to verify error handling behavior and ensure graceful degradation.
- Update glossary term URN generation to use dot notation for hierarchical paths.
- Improve logging for large file processing and ensure consistent URN formats across glossary nodes and terms.
- Refactor methods to yield MCPs for memory efficiency during processing.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 10, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 10, 2025
@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

…uidance

- Added helper text for various RDF source fields including source, format, extensions, recursive processing, environment, and dialect.
- These enhancements aim to provide clearer instructions and examples for users configuring RDF ingestion settings.
- Introduced new RDF platform entry in capability_summary.json with detailed capabilities including deletion detection, tags, ownership, lineage, data profiling, domains, descriptions, and platform instance support.
- Each capability includes a description and support status to enhance clarity for users configuring RDF ingestion.
@stephengoldbaum stephengoldbaum changed the title Rdf simplification pr RDF Ingestion Source Dec 11, 2025
@codecov
Copy link

codecov bot commented Dec 11, 2025

Bundle Report

Changes will increase total bundle size by 2.94kB (0.01%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.86MB 2.94kB (0.01%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 2.94kB 19.23MB 0.02%

Files in assets/index-*.js:

  • ./src/app/ingestV2/source/builder/RecipeForm/rdf.ts → Total Size: 2.86kB

  • ./src/app/ingestV2/source/builder/sources.json → Total Size: 37.23kB

  • ./src/app/ingestV2/source/builder/constants.ts → Total Size: 6.06kB

  • ./src/app/ingestV2/source/builder/RecipeForm/constants.ts → Total Size: 10.11kB

- Changed the warning filter to ignore specific SQLAlchemy warnings.
- Added a new dependency for RDF support in the setup configuration.
…ters

- Introduced a new documentation file for the RDF ingestion source.
- Updated type hints across various classes to use `Optional` for context parameters, enhancing code clarity and type safety.
- Adjusted method signatures in `EntityExtractor`, `EntityConverter`, `EntityMCPBuilder`, and related classes to reflect these changes.
- Included "rdf" in both base development and full test development requirements in setup.py to ensure proper support for RDF ingestion.
- Added unit tests for duplicate term definition handling, ensuring correct extraction behavior for same URIs and properties.
- Implemented comprehensive validation tests for RDF source configuration, covering required fields, type checks, and value constraints.
- Introduced connection testing unit tests to verify functionality and error handling for various scenarios, including file existence and RDF format validation.
- Developed edge case tests to handle scenarios like empty files, circular relationships, and special characters in paths.
- Enhanced error handling tests to ensure actionable feedback for file not found, invalid format, and unsupported extensions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer. product PR or Issue related to the DataHub UI/UX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant