-
Notifications
You must be signed in to change notification settings - Fork 3.3k
RDF Ingestion Source #15520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
RDF Ingestion Source #15520
Conversation
- Add RDF ingestion source for glossary terms, domains, and relationships - Streamlined architecture: extractors return DataHub AST directly - Removed unnecessary abstraction layers (RDF AST, converters where not needed) - Support for SKOS, OWL, and other RDF vocabularies - Comprehensive test coverage with 128 passing tests - UI integration for RDF source configuration
- Remove build_relationship_mcps() method from GlossaryTermMCPBuilder - Update tests to use RelationshipMCPBuilder directly - Clean separation: glossary_term handles terms, relationship handles relationships - Refactor _generate_workunits_from_ast to reduce complexity
…ance - Enhance error handling in RDFSource to provide actionable messages for missing files, malformed RDF, and invalid formats. - Implement unit tests to verify error handling behavior and ensure graceful degradation. - Update glossary term URN generation to use dot notation for hierarchical paths. - Improve logging for large file processing and ensure consistent URN formats across glossary nodes and terms. - Refactor methods to yield MCPs for memory efficiency during processing.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…uidance - Added helper text for various RDF source fields including source, format, extensions, recursive processing, environment, and dialect. - These enhancements aim to provide clearer instructions and examples for users configuring RDF ingestion settings.
- Introduced new RDF platform entry in capability_summary.json with detailed capabilities including deletion detection, tags, ownership, lineage, data profiling, domains, descriptions, and platform instance support. - Each capability includes a description and support status to enhance clarity for users configuring RDF ingestion.
Bundle ReportChanges will increase total bundle size by 2.94kB (0.01%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
Files in
|
- Changed the warning filter to ignore specific SQLAlchemy warnings. - Added a new dependency for RDF support in the setup configuration.
…ters - Introduced a new documentation file for the RDF ingestion source. - Updated type hints across various classes to use `Optional` for context parameters, enhancing code clarity and type safety. - Adjusted method signatures in `EntityExtractor`, `EntityConverter`, `EntityMCPBuilder`, and related classes to reflect these changes.
- Included "rdf" in both base development and full test development requirements in setup.py to ensure proper support for RDF ingestion.
- Added unit tests for duplicate term definition handling, ensuring correct extraction behavior for same URIs and properties. - Implemented comprehensive validation tests for RDF source configuration, covering required fields, type checks, and value constraints. - Introduced connection testing unit tests to verify functionality and error handling for various scenarios, including file existence and RDF format validation. - Developed edge case tests to handle scenarios like empty files, circular relationships, and special characters in paths. - Enhanced error handling tests to ensure actionable feedback for file not found, invalid format, and unsupported extensions.
…dbaum/datahub into rdf-simplification-pr
Summary
This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.
What's New
Core Features
type: rdf) - Native DataHub plugin for RDF/OWL ontologiesskos:Conceptandowl:Classto DataHub GlossaryTermsskos:broaderandskos:narrowerrelationships asisRelatedTermsstateful_ingestionconfigplatform_instanceconfigArchitecture
test_connection()for connection validationCapabilities
The source supports the following DataHub capabilities:
skos:broaderandskos:narrowerstateful_ingestion.enabled: trueplatform_instanceconfigskos:definitionorrdfs:comment)Testing
Test Coverage
export_only,skip_export)Test Files
tests/unit/rdf/- Unit tests for individual componentstests/integration/rdf/- Integration tests with golden file validationDocumentation
User Documentation
docs/sources/rdf/rdf.md- Comprehensive user guide (489 lines)Recipe Examples
docs/sources/rdf/rdf_recipe.yml- Example recipes for basic and stateful ingestionIntegration Test Documentation
tests/integration/rdf/README.md- Detailed guide for running integration testsConfiguration Example
source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed
Technical Notes
Security & Performance
Code Quality
New Files
src/datahub/ingestion/source/rdf/ingestion/rdf_source.py- Main source implementationsrc/datahub/ingestion/source/rdf/core/rdf_loader.py- RDF loading utilities with securitysrc/datahub/ingestion/source/rdf/core/urn_generator.py- URN generation with encodingsrc/datahub/ingestion/source/rdf/entities/base.py- Base interfaces for entity processingsrc/datahub/ingestion/source/rdf/entities/registry.py- Thread-safe entity registrydocs/sources/rdf/rdf.md- User documentationdocs/sources/rdf/rdf_recipe.yml- Recipe examplestests/integration/rdf/test_rdf_source.py- Integration teststests/unit/rdf/- Unit tests (multiple files)Modified Files
setup.py- Added RDF source to entry points (line 862)Breaking Changes
None - This is a new feature addition with no breaking changes to existing functionality.
Support Status
The RDF source is marked as INCUBATING (
SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.Checklist
setup.py@platform_name,@config_class,@support_status)test_connection()implemented