-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Requirements from @grantfitzsimmons:
Important
While not described in these requirements, the existing publishing mechanism using the app resources system and export feeds must be retained and backwards compatibility should be preserved for the near future.
Eventually, we need to discuss providing a utility for converting the legacy publishing pipeline to the new one or offering services to members for this conversion. For now, the legacy system and the modern system should remain distinct yet both functional.
These requirements describe the enhancement of the existing Specify 7 data publishing for ease of use and more efficient publishing. This includes Darwin Core publishing to aggregators and (potentially) publishing to web portals, which is currently not supported in Specify 7. Where possible, existing UI mechanisms should be used to ensure continuity for the user and enhance the user's intuitive experience.
These requirements were developed in conjunction with @acbentley and @tlammer.
Goals
- Enhance the current Specify 7 data publishing system.
- Improve user experience and efficiency in publishing data.
- Support DwC publishing to aggregators.
- Potentially enable future publishing to web portals.
- Leverage existing UI mechanisms for consistency and intuitiveness.
Aggregators we must support are below. Other aggregators that accept DwCA formatted data are also compatible:
Non-Functional Requirements
- NFR-01: Where possible, existing UI mechanisms should be used to ensure continuity for the user.
- NFR-02: The new system should enhance the user's intuitive experience.
- NFR-03: The user should never edit code to set up, map, or export their data.
- NFR-04: The user should never have to copy information from another site to map or export their data.
- NFR-05: More Darwin Core concepts and extensions may be added in the future, and users need to be able to map to those concepts.
- Users will need to add these to their existing exports, and updates should not be required on our side to use new terms.
- NFR-06: Specify’s export should require little effort on the user’s part.
Functional Requirements
Schema Configuration
- FR-01: In the Schema Config interface, if a field is selected that has one or more
sptermrecords pointing to it via astringId, it should show thespterm.Term,spterm.Description,spterm.IRI, andspterm.VocabularyURI. This gives the user an immediate clue as to what term a field is likely to be mapped to. - FR-02: Add missing DwC fields to Specify #7602
Query Builder
- FR-04: Add support for a new "schema mapping" interface built atop the query builder, including a new column for mapping terms.
- FR-05: DwC default queries must be able to be copied so users can create their own modified version.
- FR-06: Only unhidden fields should be mapped in DwC default queries.
- FR-07: DwC default queries must not be added to every users’ query list. They must be segmented from standard user queries (possibly via another menu item).
- FR-08: Implement auto-mapping for fields matching common DwC concepts.
- FR-09: Allow manual mapping of additional fields for concepts not automatically mapped and for fields used to limit results or ensure uniqueness.
- FR-10: Support mapping of fields and aggregated table formats.
- FR-11: Enable limiting query results through field criteria.
- FR-12: All exports should use the
YYYY-MM-DDInternational (ISO) Standard format, regardless of the date format configured for the database.
Darwin Core Mapping
This should be done in a query builder interface with an additional editable pick list where you can choose from a list of schema concepts.
- FR-14: Users must be able to modify the query they are using for mapping after having started the Darwin Core Mapping process.
- FR-15: Add the ability to select Darwin Core concepts in the UI to match specific query fields to concepts, both for the occurrence file and extension files.
- FR-16: Must be able to add static text that will map to DwC concepts without requiring a field mapping. This text value is stored on the query field definition.
- FR-17: Attachment URLs should be automatically constructed from the configured web asset server URL and collection if attachments (e.g. aggregated
CollectionObjectAttachments) are included in an export without additional configuration. - FR-18: Automatically map fields to DwC concepts based on the
sptermstring IDs. - FR-19: Support multiple table formats and aggregations #6435
- FR-20: Once a term has been mapped, it must not be mapped again. The UI must block the user from assigning the same term twice within a single mapping.
- FR-21: Must be able to add fields to exports that are not mapped to a DwC concept.
- FR-22: Uniqueness validation is context-dependent:
- For Core mappings (e.g. CollectionObject),
occurrenceIDmust be unique. - For Extension mappings (e.g. Determination), the unique key is the base table ID (e.g.
DeterminationID), but theoccurrenceIDfield must be present to link back to the Core. Multiple extension rows may share the sameoccurrenceID.
- For Core mappings (e.g. CollectionObject),
I have already created a mapping of DwC concepts to Specify fields here: DWC Terms to Specify
We need to add all of the current accepted Darwin Core terms into the spterm table with the mapping described in this spreadsheet.
Validation
- FR-23: Provide the ability to validate results before exporting.
- FR-24: Include validation for duplicate records in the Core and Extension files.
Validation Steps:
- Verify that all required fields are present for publishing (For GBIF,
dwc:eventDate,dwc:basisOfRecord,dwc:scientificName, anddwc:occurrenceID) - Verify that each
occurrenceIDonly appears once (for extensions, verify that the base table record IDs only appear once). - Verify that the export mapping and EML is valid.
- Provide a link to the GBIF data validator so the user can verify it externally.
Data Output
- FR-25: File output must be a Darwin Core Archive (DwCA).
- FR-26: If all steps are followed correctly, the export produced must match current standards and be validated without errors by the GBIF data validator.
DwCA Ecological Metadata
- FR-27: There should be a straightforward mechanism for creating or adding Ecological Metadata Language (EML) associated with a published data set. We recommend using the EML generator built and maintained by GBIF Norway. When creating a new
exportdatasetrecord, users should easily select and import an EML file to automatically create the app resource, minimizing any friction from the form.- As in FR-26, the EML created must match current standards and be validated without errors by the GBIF data validator.
RSS Publishing
- FR-28: Automatic RSS publishing needs to work without an external cron job (Make RSS Feed DwCA automatic export process internal #1166)
Permissions
- FR-29: Institution Administrators are the only users who can use the data publishing tools.
User Tool
- FR-30: Create a User Tool item where access to all the files for Data Publishing are located.
Darwin Core Updates & Versioning
- FR-31: The
sptermtable serves as the single source of truth for the Darwin Core version currently supported by the installation. - FR-32: System-provided terms in
sptermmust be markedIsSystem = Trueand cannot be edited by users. - FR-33: Specify software updates will handle standard changes (e.g., new terms, deprecated terms) by inserting or updating records in
spterm. - FR-34: Users must be able to manually add Custom Terms (
IsSystem = False) to use new concepts before an official software update is released. - FR-35: Existing mappings must remain stable during updates; since mappings link to the term's database ID, changes to a term's metadata (IRI or description) or the addition of new terms must not break existing export configurations.
stateDiagram-v2
[*] --> SystemTerm : Specify Update Released
SystemTerm : IsSystem = True
SystemTerm : Provided by SCC
[*] --> CustomTerm : User Adds Term
CustomTerm : IsSystem = False
CustomTerm : Managed by User
state "Export Execution" as Export {
[*] --> CheckMapping
CheckMapping --> UseTermID : Mapping links to ID in spqueryfield
UseTermID --> OutputHeader : Uses Term Name & IRI
}
SystemTerm --> Export : Used in Mappings
CustomTerm --> Export : Used in Mappings
Note right of SystemTerm
Updates to description/IRI
by Specify do not break
mappings (ID stays same).
End note
Additional Deliverables
This work requires the implementation of technical components before beginning. These components will be packaged with the release as deliverables accessible directly to the user. Default mappings must be easily selected and used without requiring the user to build a query first.
These may be reviewed by the SCC member community and/or the board.
- Develop at least one default mapping from Collection Object to Darwin Core Occurrence, based on aforementioned mapping
- Develop default extension mappings for the following extensions:
- Identification History
- Audiovisual Core
- GGBN Material Sample
- EOL References
- Resource Relationship
Proposed Model
Below is a detailed outline of the model within Specify. This model distinguishes between standard queries and "Schema Mappings" to prevent user confusion and ensures terms are mapped at the field level.
Export Data Set Table exportdataset
An export data set groups together the critical components for publishing your data, used to create a data set on platforms like GBIF.
This is a replacement for the current ExportFeed app resource.
| Field | Type | Description | Example |
|---|---|---|---|
| ExportName | Text | The name of the export. | KUBI Ichthyology Voucher |
| FileName | Text | The name of the export file once packaged, always ending with .zip. |
kui-dwca.zip |
| RSS | Checkbox | Indicates if this should be made available via the RSS feed when updated | Yes |
| Frequency | Integer | If published, this represents the number of days between automatically updating the RSS feed | |
| Metadata | Link to spappresource |
A link to the app resource containing the Ecological Metadata Language (EML) created for the data set being published. | EML data sourced from GBIF or created using the GBIF EML generator |
| CoreMapping | Link to schemamapping |
Links to the primary Core schema mapping for the export (e.g. Occurrence). | Voucher (schema mapping name) |
| Extensions | One-to-many to extension |
A one-to-many relationship where many schema mappings can be linked to a single export mapping. | GBIF Identification, CO Audubon Core (schema mapping names) |
Extensions extensions
A join table that bridges exportdataset with schemamapping to capture the one-to-many nature of extensions.
| Field | Type | Description | Example |
|---|---|---|---|
| Mapping | Link to schemamapping |
Links to the extension’s schema mapping. The system does not require the same number of rows as the Core, but the extension query must include the occurrenceID (inherited from the Core query) to facilitate the join. |
CO Audubon Core |
| ExportDataSet | Link to exportdataset |
Links to the export data set the extension is connected to. |
Schema Mapping Table schemamapping
A schema mapping is a strict wrapper around the standard query system (spquery). It segregates "Mapping Queries" from standard "User Queries" in the UI.
This is the replacement for the
spexportschemamappingsystem in Specify 6.
The distinction between the spquery and the schemamapping record should be invisible for the user for all intents and purposes. When the user creates a schemamapping, it should ask them if this is a "Core" or "Extension" mapping, and they can provide a description. On the user side of things, the title of the query can be used to identify the mapping.
Implementation is up to the development team as to whether this table is needed or if extensions to the spquery table is sufficient.
This table defines whether the underlying query is a Core (e.g., Occurrence) or an Extension (e.g., Audubon Core).
| Field | Type | Description | Example |
|---|---|---|---|
| Query | Link to spquery |
One-to-One link to spquery. The underlying query engine handles the logic. (Required) |
|
| MappingType | Enum | Defines the role: Core (Occurrence) or Extension. (Required) |
Core |
| Description | Text | User-facing description of what this mapping achieves. | Maps Collection Object to DwC Occurrence |
Query Field Extensions spqueryfield modification
The existing spqueryfield table is extended to support mapping specific columns to terms and supporting static values. This allows the term to be associated with the specific output column, rather than the query as a whole.
| Field | Description | Example |
|---|---|---|
| Term | Nullable link to spterm. If set, this column is exported with the Term Name as the header. |
CatalogNumber |
| IsStatic | Boolean. If true, the StringId is ignored. |
True |
| StaticValue | The actual static text to export if IsStatic is true. |
"PreservedSpecimen" |
Terms spterm
The Terms table/resource acts as the controlled vocabulary for Darwin Core and extension terms. This table represents the version currently supported by Specify.
- System Terms: Read-only terms provided by Specify updates.
- Custom Terms: Users can add new terms to support new extensions, but cannot edit system terms.
Important
Instead of StringID, if a mapping path is better (e.g. field names connected together) we should use that since it is easily understood by the user.
| Field | Description | Example |
|---|---|---|
| IRI | IRI (Internationalized Resource Identifier) is a unique, stable, and machine-readable identifier for a resource. This is used when constructing the meta.xml for publishing. |
http://rs.tdwg.org/dwc/terms/catalogNumber |
| Term | A term is a standardized metadata element from a vocabulary used to consistently describe and share collections data such as specimens, observations, and related information. | catalogNumber |
| Description | The description provided by the schema (Read Only). | An identifier (preferably unique) for the record within the data set or collection. |
| StringID | The field stringId for the path added to the query to assist in automatically mapping it. |
1.collectionobject.catalogNumber |
| VocabularyURI | Groups terms by schema. | http://rs.tdwg.org/dwc/terms/ |
| IsSystem | Indicates if this is a system-provided term. | True |
High-Level Entity Relationship Diagram (ERD)
This diagram visualizes the relationships between the new tables and the existing system (spquery, spqueryfield, spappresource, collection).
erDiagram
%% The main configuration object for an export
exportdataset ||--|| schemamapping : "has Core Mapping"
exportdataset ||--o{ extensions : "has Extensions"
exportdataset ||--o{ spappresource : "has Metadata (EML)"
exportdataset ||--o{ collection : "belongs to"
%% The extension join table
extensions }o--|| schemamapping : "uses Mapping"
%% The Schema Mapping wrapper
schemamapping ||--|| spquery : "wraps Query"
schemamapping }o--|| specifyuser : "owned by"
%% The Query and its fields
spquery ||--o{ spqueryfield : "contains fields"
%% The Term definition
spqueryfield }o--|| spterm : "maps to Term"
%% Term definitions
spterm {
string IRI
string Term
string VocabularyURI
boolean IsSystem
}
%% Query Field Extensions
spqueryfield {
string FieldName
boolean IsStatic
string StaticValue
}
%% Schema Mapping Types
schemamapping {
enum MappingType "Core/Extension"
}
Data Flow: Defining an Export
This sequence diagram demonstrates the workflow for an Institution Admin to define mappings and create an export data set.
sequenceDiagram
participant User as Institution Admin
participant UI as Specify 7 UI
participant Backend as Export Engine
participant DB as Database
Note over User, DB: Step 1: Define Mappings (Core & Extensions)
User->>UI: Create New Schema Mapping (Core)
UI->>User: Display Query Builder with "DwC Term" column
User->>UI: Select Fields & Map to Terms
User->>UI: Save Mapping as "Core"
UI->>DB: Insert into schemamapping & spquery
User->>UI: Create New Schema Mapping (Extension)
UI->>User: Display Query Builder
User->>UI: Select Fields (must include occurrenceID)
User->>UI: Map to Terms & Save as "Extension"
UI->>DB: Insert into schemamapping & spquery
Note over User, DB: Step 2: Create Export Data Set
User->>UI: Create New Export Data Set
User->>UI: Select Core Mapping
User->>UI: Select Extension Mappings
User->>UI: Link EML Metadata Resource
User->>DB: Save exportdataset
Note over User, DB: Step 3: Execution (Publishing)
User->>UI: Click "Export" / Auto-Scheduler triggers
UI->>Backend: Request Export (ID)
Backend->>DB: Fetch Core Query & Extension Queries via exportdataset
Backend->>Backend: Execute Core Query -> Write Core CSV
Backend->>Backend: Execute Extension Queries -> Write Ext CSVs
Backend->>Backend: Generate meta.xml from mappings
Backend->>Backend: Package ZIP (DwCA)
Backend->>User: Return Download Link / Update RSS
Logic Flow: Field Mapping Execution
The backend needs to determine what value to write for a specific column in the export file.
flowchart TD
Start[Start Export Row Processing] --> NextField{Next Column?}
NextField -- Yes --> CheckStatic{IsStatic = True?}
NextField -- No --> End[Finish Row]
CheckStatic -- Yes --> WriteStatic[Write 'StaticValue' to CSV]
WriteStatic --> NextField
CheckStatic -- No --> CheckPath{Has StringID Path?}
CheckPath -- Yes --> FetchDB[Fetch Value from DB using StringID]
FetchDB --> WriteDB[Write DB Value to CSV]
WriteDB --> NextField
CheckPath -- No --> WriteNull[Write Empty/Null to CSV]
WriteNull --> NextField
Original Issue
A user interface for mapping a Specify query to Darwin Core terms. This could be used multiple times throughout the interface, used to calculated MIDS levels for each record (#4604), share data easily to GBIF and other data aggregators, and much more.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status