Skip to content

Specify Data Publishing Toolkit #285

@benanhalt

Description

@benanhalt

Requirements from @grantfitzsimmons:

Important

While not described in these requirements, the existing publishing mechanism using the app resources system and export feeds must be retained and backwards compatibility should be preserved for the near future.

Eventually, we need to discuss providing a utility for converting the legacy publishing pipeline to the new one or offering services to members for this conversion. For now, the legacy system and the modern system should remain distinct yet both functional.

These requirements describe the enhancement of the existing Specify 7 data publishing for ease of use and more efficient publishing. This includes Darwin Core publishing to aggregators and (potentially) publishing to web portals, which is currently not supported in Specify 7. Where possible, existing UI mechanisms should be used to ensure continuity for the user and enhance the user's intuitive experience.

These requirements were developed in conjunction with @acbentley and @tlammer.

Goals

  • Enhance the current Specify 7 data publishing system.
  • Improve user experience and efficiency in publishing data.
  • Support DwC publishing to aggregators.
  • Potentially enable future publishing to web portals.
  • Leverage existing UI mechanisms for consistency and intuitiveness.

Aggregators we must support are below. Other aggregators that accept DwCA formatted data are also compatible:

Non-Functional Requirements

  • NFR-01: Where possible, existing UI mechanisms should be used to ensure continuity for the user.
  • NFR-02: The new system should enhance the user's intuitive experience.
  • NFR-03: The user should never edit code to set up, map, or export their data.
  • NFR-04: The user should never have to copy information from another site to map or export their data.
  • NFR-05: More Darwin Core concepts and extensions may be added in the future, and users need to be able to map to those concepts.
    • Users will need to add these to their existing exports, and updates should not be required on our side to use new terms.
  • NFR-06: Specify’s export should require little effort on the user’s part.

Functional Requirements

Schema Configuration

  • FR-01: In the Schema Config interface, if a field is selected that has one or more spterm records pointing to it via a stringId, it should show the spterm.Term, spterm.Description, spterm.IRI, and spterm.VocabularyURI. This gives the user an immediate clue as to what term a field is likely to be mapped to.
  • FR-02: Add missing DwC fields to Specify #7602

Query Builder

  • FR-04: Add support for a new "schema mapping" interface built atop the query builder, including a new column for mapping terms.
  • FR-05: DwC default queries must be able to be copied so users can create their own modified version.
  • FR-06: Only unhidden fields should be mapped in DwC default queries.
  • FR-07: DwC default queries must not be added to every users’ query list. They must be segmented from standard user queries (possibly via another menu item).
  • FR-08: Implement auto-mapping for fields matching common DwC concepts.
  • FR-09: Allow manual mapping of additional fields for concepts not automatically mapped and for fields used to limit results or ensure uniqueness.
  • FR-10: Support mapping of fields and aggregated table formats.
  • FR-11: Enable limiting query results through field criteria.
  • FR-12: All exports should use the YYYY-MM-DD International (ISO) Standard format, regardless of the date format configured for the database.

Darwin Core Mapping

This should be done in a query builder interface with an additional editable pick list where you can choose from a list of schema concepts.

  • FR-14: Users must be able to modify the query they are using for mapping after having started the Darwin Core Mapping process.
  • FR-15: Add the ability to select Darwin Core concepts in the UI to match specific query fields to concepts, both for the occurrence file and extension files.
    • Users should not have to search another site to link a Specify field to a concept.
    • The user should be able to click on an icon (perhaps Image) which appears next to a term. This should show them a description of the term with a link to the quick reference guide if applicable.
  • FR-16: Must be able to add static text that will map to DwC concepts without requiring a field mapping. This text value is stored on the query field definition.
  • FR-17: Attachment URLs should be automatically constructed from the configured web asset server URL and collection if attachments (e.g. aggregated CollectionObjectAttachments) are included in an export without additional configuration.
  • FR-18: Automatically map fields to DwC concepts based on the spterm string IDs.
  • FR-19: Support multiple table formats and aggregations #6435
  • FR-20: Once a term has been mapped, it must not be mapped again. The UI must block the user from assigning the same term twice within a single mapping.
  • FR-21: Must be able to add fields to exports that are not mapped to a DwC concept.
  • FR-22: Uniqueness validation is context-dependent:
    • For Core mappings (e.g. CollectionObject), occurrenceID must be unique.
    • For Extension mappings (e.g. Determination), the unique key is the base table ID (e.g. DeterminationID), but the occurrenceID field must be present to link back to the Core. Multiple extension rows may share the same occurrenceID.

I have already created a mapping of DwC concepts to Specify fields here: DWC Terms to Specify

We need to add all of the current accepted Darwin Core terms into the spterm table with the mapping described in this spreadsheet.

Validation

  • FR-23: Provide the ability to validate results before exporting.
  • FR-24: Include validation for duplicate records in the Core and Extension files.

Validation Steps:

  1. Verify that all required fields are present for publishing (For GBIF, dwc:eventDate, dwc:basisOfRecord, dwc:scientificName, and dwc:occurrenceID)
  2. Verify that each occurrenceID only appears once (for extensions, verify that the base table record IDs only appear once).
  3. Verify that the export mapping and EML is valid.
  4. Provide a link to the GBIF data validator so the user can verify it externally.

Data Output

  • FR-25: File output must be a Darwin Core Archive (DwCA).
  • FR-26: If all steps are followed correctly, the export produced must match current standards and be validated without errors by the GBIF data validator.

DwCA Ecological Metadata

  • FR-27: There should be a straightforward mechanism for creating or adding Ecological Metadata Language (EML) associated with a published data set. We recommend using the EML generator built and maintained by GBIF Norway. When creating a new exportdataset record, users should easily select and import an EML file to automatically create the app resource, minimizing any friction from the form.
    • As in FR-26, the EML created must match current standards and be validated without errors by the GBIF data validator.

RSS Publishing

Permissions

  • FR-29: Institution Administrators are the only users who can use the data publishing tools.

User Tool

  • FR-30: Create a User Tool item where access to all the files for Data Publishing are located.

Darwin Core Updates & Versioning

  • FR-31: The spterm table serves as the single source of truth for the Darwin Core version currently supported by the installation.
  • FR-32: System-provided terms in spterm must be marked IsSystem = True and cannot be edited by users.
  • FR-33: Specify software updates will handle standard changes (e.g., new terms, deprecated terms) by inserting or updating records in spterm.
  • FR-34: Users must be able to manually add Custom Terms (IsSystem = False) to use new concepts before an official software update is released.
  • FR-35: Existing mappings must remain stable during updates; since mappings link to the term's database ID, changes to a term's metadata (IRI or description) or the addition of new terms must not break existing export configurations.
stateDiagram-v2
    [*] --> SystemTerm : Specify Update Released
    SystemTerm : IsSystem = True
    SystemTerm : Provided by SCC

    [*] --> CustomTerm : User Adds Term
    CustomTerm : IsSystem = False
    CustomTerm : Managed by User

    state "Export Execution" as Export {
        [*] --> CheckMapping
        CheckMapping --> UseTermID : Mapping links to ID in spqueryfield
        UseTermID --> OutputHeader : Uses Term Name & IRI
    }

    SystemTerm --> Export : Used in Mappings
    CustomTerm --> Export : Used in Mappings

    Note right of SystemTerm
        Updates to description/IRI
        by Specify do not break
        mappings (ID stays same).
    End note

Loading

Additional Deliverables

This work requires the implementation of technical components before beginning. These components will be packaged with the release as deliverables accessible directly to the user. Default mappings must be easily selected and used without requiring the user to build a query first.

These may be reviewed by the SCC member community and/or the board.

Proposed Model

Below is a detailed outline of the model within Specify. This model distinguishes between standard queries and "Schema Mappings" to prevent user confusion and ensures terms are mapped at the field level.

Image

Export Data Set Table exportdataset

An export data set groups together the critical components for publishing your data, used to create a data set on platforms like GBIF.

This is a replacement for the current ExportFeed app resource.

Field Type Description Example
ExportName Text The name of the export. KUBI Ichthyology Voucher
FileName Text The name of the export file once packaged, always ending with .zip. kui-dwca.zip
RSS Checkbox Indicates if this should be made available via the RSS feed when updated Yes
Frequency Integer If published, this represents the number of days between automatically updating the RSS feed
Metadata Link to spappresource A link to the app resource containing the Ecological Metadata Language (EML) created for the data set being published. EML data sourced from GBIF or created using the GBIF EML generator
CoreMapping Link to schemamapping Links to the primary Core schema mapping for the export (e.g. Occurrence). Voucher (schema mapping name)
Extensions One-to-many to extension A one-to-many relationship where many schema mappings can be linked to a single export mapping. GBIF Identification, CO Audubon Core (schema mapping names)

Extensions extensions

A join table that bridges exportdataset with schemamapping to capture the one-to-many nature of extensions.

Field Type Description Example
Mapping Link to schemamapping Links to the extension’s schema mapping. The system does not require the same number of rows as the Core, but the extension query must include the occurrenceID (inherited from the Core query) to facilitate the join. CO Audubon Core
ExportDataSet Link to exportdataset Links to the export data set the extension is connected to.

Schema Mapping Table schemamapping

A schema mapping is a strict wrapper around the standard query system (spquery). It segregates "Mapping Queries" from standard "User Queries" in the UI.

This is the replacement for the spexportschemamapping system in Specify 6.

The distinction between the spquery and the schemamapping record should be invisible for the user for all intents and purposes. When the user creates a schemamapping, it should ask them if this is a "Core" or "Extension" mapping, and they can provide a description. On the user side of things, the title of the query can be used to identify the mapping.

Implementation is up to the development team as to whether this table is needed or if extensions to the spquery table is sufficient.

This table defines whether the underlying query is a Core (e.g., Occurrence) or an Extension (e.g., Audubon Core).

Field Type Description Example
Query Link to spquery One-to-One link to spquery. The underlying query engine handles the logic. (Required)
MappingType Enum Defines the role: Core (Occurrence) or Extension. (Required) Core
Description Text User-facing description of what this mapping achieves. Maps Collection Object to DwC Occurrence

Query Field Extensions spqueryfield modification

The existing spqueryfield table is extended to support mapping specific columns to terms and supporting static values. This allows the term to be associated with the specific output column, rather than the query as a whole.

Field Description Example
Term Nullable link to spterm. If set, this column is exported with the Term Name as the header. CatalogNumber
IsStatic Boolean. If true, the StringId is ignored. True
StaticValue The actual static text to export if IsStatic is true. "PreservedSpecimen"

Terms spterm

The Terms table/resource acts as the controlled vocabulary for Darwin Core and extension terms. This table represents the version currently supported by Specify.

  • System Terms: Read-only terms provided by Specify updates.
  • Custom Terms: Users can add new terms to support new extensions, but cannot edit system terms.

Important

Instead of StringID, if a mapping path is better (e.g. field names connected together) we should use that since it is easily understood by the user.

Field Description Example
IRI IRI (Internationalized Resource Identifier) is a unique, stable, and machine-readable identifier for a resource. This is used when constructing the meta.xml for publishing. http://rs.tdwg.org/dwc/terms/catalogNumber
Term A term is a standardized metadata element from a vocabulary used to consistently describe and share collections data such as specimens, observations, and related information. catalogNumber
Description The description provided by the schema (Read Only). An identifier (preferably unique) for the record within the data set or collection.
StringID The field stringId for the path added to the query to assist in automatically mapping it. 1.collectionobject.catalogNumber
VocabularyURI Groups terms by schema. http://rs.tdwg.org/dwc/terms/
IsSystem Indicates if this is a system-provided term. True

High-Level Entity Relationship Diagram (ERD)

This diagram visualizes the relationships between the new tables and the existing system (spquery, spqueryfield, spappresource, collection).

erDiagram
    %% The main configuration object for an export
    exportdataset ||--|| schemamapping : "has Core Mapping"
    exportdataset ||--o{ extensions : "has Extensions"
    exportdataset ||--o{ spappresource : "has Metadata (EML)"
    exportdataset ||--o{ collection : "belongs to"

    %% The extension join table
    extensions }o--|| schemamapping : "uses Mapping"

    %% The Schema Mapping wrapper
    schemamapping ||--|| spquery : "wraps Query"
    schemamapping }o--|| specifyuser : "owned by"

    %% The Query and its fields
    spquery ||--o{ spqueryfield : "contains fields"

    %% The Term definition
    spqueryfield }o--|| spterm : "maps to Term"

    %% Term definitions
    spterm {
        string IRI
        string Term
        string VocabularyURI
        boolean IsSystem
    }

    %% Query Field Extensions
    spqueryfield {
        string FieldName
        boolean IsStatic
        string StaticValue
    }

    %% Schema Mapping Types
    schemamapping {
        enum MappingType "Core/Extension"
    }

Loading

Data Flow: Defining an Export

This sequence diagram demonstrates the workflow for an Institution Admin to define mappings and create an export data set.

sequenceDiagram
    participant User as Institution Admin
    participant UI as Specify 7 UI
    participant Backend as Export Engine
    participant DB as Database

    Note over User, DB: Step 1: Define Mappings (Core & Extensions)

    User->>UI: Create New Schema Mapping (Core)
    UI->>User: Display Query Builder with "DwC Term" column
    User->>UI: Select Fields & Map to Terms
    User->>UI: Save Mapping as "Core"
    UI->>DB: Insert into schemamapping & spquery

    User->>UI: Create New Schema Mapping (Extension)
    UI->>User: Display Query Builder
    User->>UI: Select Fields (must include occurrenceID)
    User->>UI: Map to Terms & Save as "Extension"
    UI->>DB: Insert into schemamapping & spquery

    Note over User, DB: Step 2: Create Export Data Set

    User->>UI: Create New Export Data Set
    User->>UI: Select Core Mapping
    User->>UI: Select Extension Mappings
    User->>UI: Link EML Metadata Resource
    User->>DB: Save exportdataset

    Note over User, DB: Step 3: Execution (Publishing)

    User->>UI: Click "Export" / Auto-Scheduler triggers
    UI->>Backend: Request Export (ID)
    Backend->>DB: Fetch Core Query & Extension Queries via exportdataset
    Backend->>Backend: Execute Core Query -> Write Core CSV
    Backend->>Backend: Execute Extension Queries -> Write Ext CSVs
    Backend->>Backend: Generate meta.xml from mappings
    Backend->>Backend: Package ZIP (DwCA)
    Backend->>User: Return Download Link / Update RSS

Loading

Logic Flow: Field Mapping Execution

The backend needs to determine what value to write for a specific column in the export file.

flowchart TD
    Start[Start Export Row Processing] --> NextField{Next Column?}
    NextField -- Yes --> CheckStatic{IsStatic = True?}
    NextField -- No --> End[Finish Row]

    CheckStatic -- Yes --> WriteStatic[Write 'StaticValue' to CSV]
    WriteStatic --> NextField

    CheckStatic -- No --> CheckPath{Has StringID Path?}
    CheckPath -- Yes --> FetchDB[Fetch Value from DB using StringID]
    FetchDB --> WriteDB[Write DB Value to CSV]
    WriteDB --> NextField

    CheckPath -- No --> WriteNull[Write Empty/Null to CSV]
    WriteNull --> NextField

Loading

Original Issue

@grantfitzsimmons:

A user interface for mapping a Specify query to Darwin Core terms. This could be used multiple times throughout the interface, used to calculated MIDS levels for each record (#4604), share data easily to GBIF and other data aggregators, and much more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2 - Exporting DataIssues that are related to exporting data to DwC, GBIF, IPT, Web Portal, etc.2 - QueriesIssues that are related to the query builder or queries in general2 - Schema ConfigIssues that are related to the Schema Config toolSeparationFrom6

    Projects

    Status

    📋 Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions