Specify Data Publishing Toolkit

## Requirements from @grantfitzsimmons:

> [!IMPORTANT]
> While not described in these requirements, the **existing** publishing mechanism using the app resources system and export feeds must be retained and backwards compatibility should be preserved for the near future.
>
> Eventually, we need to discuss providing a utility for converting the legacy publishing pipeline to the new one or offering services to members for this conversion. For now, the legacy system and the modern system should remain _distinct_ yet both functional.

These requirements describe the enhancement of the existing Specify 7 data publishing for ease of use and more efficient publishing. This includes [Darwin Core](https://dwc.tdwg.org/terms/#dwc:associatedTaxa) publishing to aggregators and (potentially) publishing to web portals, which is currently not supported in Specify 7. Where possible, existing UI mechanisms should be used to ensure continuity for the user and enhance the user's intuitive experience.

These requirements were developed in conjunction with @acbentley and @tlammer.

## Goals

* Enhance the current Specify 7 data publishing system.
* Improve user experience and efficiency in publishing data.
* Support DwC publishing to aggregators.
* Potentially enable future publishing to web portals.
* Leverage existing UI mechanisms for consistency and intuitiveness.

Aggregators we **must** support are below. Other aggregators that accept DwCA formatted data are also compatible:
* [GBIF](https://www.gbif.org/)
* [iDigBio](https://www.idigbio.org/)
* [Fishnet2](https://www.fishnet2.net/)
* [GGBN](https://www.ggbn.org/ggbn_portal/)

## Non-Functional Requirements

* **NFR-01:** Where possible, existing UI mechanisms should be used to ensure continuity for the user.
* **NFR-02:** The new system should enhance the user's intuitive experience.
* **NFR-03:** The user should **never edit code to set up, map, or export** their data.
* **NFR-04:** The user should never have to _copy_ information from another site to map or export their data.
* **NFR-05:** More Darwin Core concepts and extensions may be added in the future, and users need to be able to map to those concepts.
    * Users will need to add these to their **existing** exports, and updates should not be required on our side to use new terms.
* **NFR-06:** Specify’s export should require little effort on the user’s part.

## Functional Requirements

### Schema Configuration

* **FR-01:** In the Schema Config interface, if a field is selected that has one or more `spterm` records pointing to it via a `stringId`, it should show the `spterm.Term`, `spterm.Description`, `spterm.IRI`, and `spterm.VocabularyURI`. This gives the user an immediate clue as to what term a field is likely to be mapped to.
* **FR-02:** [https://github.com/specify/specify7/issues/7602](https://github.com/specify/specify7/issues/7602)

### Query Builder

* **FR-04:** Add support for a new "schema mapping" interface built atop the query builder, including a new column for mapping terms.
* **FR-05:** DwC default queries must be able to be copied so users can create their own modified version.
* **FR-06:** Only unhidden fields should be mapped in DwC default queries.
* **FR-07:** DwC default queries must not be added to every users’ query list. They must be **segmented** from standard user queries (possibly via another menu item).
* **FR-08:** Implement auto-mapping for fields matching common DwC concepts.
* **FR-09:** Allow manual mapping of additional fields for concepts not automatically mapped and for fields used to limit results or ensure uniqueness.
* **FR-10:** Support mapping of fields and aggregated table formats.
* **FR-11:** Enable limiting query results through field criteria.
* **FR-12:** All exports should use the `YYYY-MM-DD` International (ISO) Standard format, regardless of the date format configured for the database.

### Darwin Core Mapping

This should be done in a query builder interface with an additional editable pick list where you can choose from a list of schema concepts.

* **FR-14:** Users must be able to modify the query they are using for mapping after having started the Darwin Core Mapping process.
* **FR-15:** Add the ability to select Darwin Core concepts in the UI to match specific query **fields** to concepts, both for the occurrence file and extension files.
    * Users should not have to search another site to link a Specify field to a concept.
    * The user should be able to click on an icon (perhaps <img width="15" height="15" alt="Image" src="https://github.com/user-attachments/assets/ea6d1d6c-664e-45b1-b594-1d44d698fa26" />) which appears next to a term. This should show them a description of the term with a link to the quick reference guide if applicable.
* **FR-16:** Must be able to add static text that will map to DwC concepts without requiring a field mapping. This text value is stored on the query field definition.
* **FR-17:** Attachment URLs should be automatically constructed from the configured web asset server URL and collection _if_ attachments (e.g. aggregated `CollectionObjectAttachments`) are included in an export without additional configuration.
* **FR-18:** Automatically map fields to DwC concepts based on the `spterm` string IDs.
* **FR-19:** [https://github.com/specify/specify7/issues/6435](https://github.com/specify/specify7/issues/6435)
* **FR-20**: Once a term has been mapped, it must not be mapped again. The UI must block the user from assigning the same term twice within a single mapping.
* **FR-21:** Must be able to add fields to exports that are not mapped to a DwC concept.
* **FR-22:** Uniqueness validation is context-dependent:
	* For **Core** mappings (e.g. CollectionObject), `occurrenceID` must be unique.
	* For **Extension** mappings (e.g. Determination), the unique key is the base table ID (e.g. `DeterminationID`), but the `occurrenceID` field must be present to link back to the Core. Multiple extension rows may share the same `occurrenceID`.

I have already created a mapping of DwC concepts to Specify fields here: [**DWC Terms to Specify**](https://docs.google.com/spreadsheets/d/1KHLYvvndBYkbKU2YwdHSXa4t_rZlcQ5IgW8licn_F40/edit?gid=1471306195#gid=1471306195)

We need to add all of the current accepted Darwin Core terms into the `spterm` table with the mapping described in this spreadsheet.

### Validation

* **FR-23:** Provide the ability to validate results before exporting.
* **FR-24:** Include validation for duplicate records in the Core and Extension files.

**Validation Steps:**
1. Verify that all required fields are present for publishing (For GBIF, `dwc:eventDate`, `dwc:basisOfRecord`, `dwc:scientificName`, and `dwc:occurrenceID`)
2. Verify that each `occurrenceID` only appears once (for extensions, verify that the base table record IDs only appear once).
3. Verify that the export mapping and EML is valid.
4. Provide a link to the GBIF data validator so the user can verify it externally.

### Data Output

* **FR-25:** File output must be a Darwin Core Archive (DwCA).
* **FR-26:** If all steps are followed correctly, the export produced **must** **match** current standards and be validated without errors by the [GBIF data validator](https://www.gbif.org/tools/data-validator).

### DwCA Ecological Metadata

* **FR-27:** There should be a straightforward mechanism for creating or adding Ecological Metadata Language (EML) associated with a published data set. We recommend using the [EML generator built and maintained by GBIF Norway](https://gbif-norway.github.io/eml-generator-js). When creating a new `exportdataset` record, users should easily select and import an EML file to automatically create the app resource, minimizing any friction from the form.
    * As in FR-26, the EML created must match current standards and be validated without errors by the [GBIF data validator](https://www.gbif.org/tools/data-validator).

### RSS Publishing

* **FR-28**: Automatic RSS publishing needs to work without an external cron job ([https://github.com/specify/specify7/issues/1166](https://github.com/specify/specify7/issues/1166))

### Permissions

* **FR-29:** Institution Administrators are the only users who can use the data publishing tools. 

### User Tool

* **FR-30:** Create a User Tool item where access to all the files for Data Publishing are located.

### Darwin Core Updates & Versioning

* **FR-31:** The `spterm` table serves as the single source of truth for the Darwin Core version currently supported by the installation.
* **FR-32:** System-provided terms in `spterm` must be marked `IsSystem = True` and cannot be edited by users.
* **FR-33:** Specify software updates will handle standard changes (e.g., new terms, deprecated terms) by inserting or updating records in `spterm`.
* **FR-34:** Users must be able to manually add **Custom Terms** (`IsSystem = False`) to use new concepts before an official software update is released.
* **FR-35:** Existing mappings must remain stable during updates; since mappings link to the term's database ID, changes to a term's metadata (IRI or description) or the addition of new terms must not break existing export configurations.

```mermaid
stateDiagram-v2
    [*] --> SystemTerm : Specify Update Released
    SystemTerm : IsSystem = True
    SystemTerm : Provided by SCC

    [*] --> CustomTerm : User Adds Term
    CustomTerm : IsSystem = False
    CustomTerm : Managed by User

    state "Export Execution" as Export {
        [*] --> CheckMapping
        CheckMapping --> UseTermID : Mapping links to ID in spqueryfield
        UseTermID --> OutputHeader : Uses Term Name & IRI
    }

    SystemTerm --> Export : Used in Mappings
    CustomTerm --> Export : Used in Mappings

    Note right of SystemTerm
        Updates to description/IRI
        by Specify do not break
        mappings (ID stays same).
    End note

```

### Additional Deliverables

This work requires the implementation of technical components before beginning. These components will be packaged with the release as deliverables accessible directly to the user. Default mappings must be easily selected and used without requiring the user to build a query first.

These may be reviewed by the SCC member community and/or the board.

* [ ] Develop at least one default mapping from Collection Object to [Darwin Core Occurrence](https://rs.gbif.org/core/dwc_occurrence_2025-07-10.xml), based on aforementioned mapping
	* Develop default extension mappings for the following extensions:
	* [ ] [Identification History](https://rs.gbif.org/extension/dwc/identification_history_2025-07-10.xml)
	* [ ] [Audiovisual Core](https://rs.gbif.org/extension/ac/audiovisual_2024_11_07.xml)
	* [ ] [GGBN Material Sample](https://rs.gbif.org/extension/ggbn/materialsample.xml)
	* [ ] [EOL References](https://rs.gbif.org/extension/eol/reference_extension.xml)
	* [ ] [Resource Relationship](https://rs.gbif.org/extension/dwc/resource_relationship_2025-07-10.xml)

## **Proposed Model**

Below is a detailed outline of the model within Specify. This model distinguishes between standard queries and "Schema Mappings" to prevent user confusion and ensures terms are mapped at the field level.

<img width="2351" height="1250" alt="Image" src="https://github.com/user-attachments/assets/7352f5b3-d112-420e-bf22-bef8d82b0fbd" />

**Export Data Set Table** `exportdataset`

An **export data set** groups together the critical components for publishing your data, used to create a data set on platforms like GBIF. 

_This is a replacement for the current `ExportFeed` app resource._

| Field | Type | Description | Example |
| --- | --- | --- | --- |
| ExportName | Text | The name of the export. | KUBI Ichthyology Voucher |
| FileName | Text | The name of the export file once packaged, always ending with `.zip`. | kui-dwca.zip |
| RSS | Checkbox | Indicates if this should be made available via the RSS feed when updated | Yes |
| Frequency | Integer | If published, this represents the number of days between automatically updating the RSS feed |  |
| Metadata | Link to `spappresource` | A link to the app resource containing the Ecological Metadata Language (EML) created for the data set being published. | EML data sourced from GBIF or created using the [GBIF EML generator](https://gbif-norway.github.io/eml-generator-js/) |
| CoreMapping | Link to `schemamapping` | Links to the primary **Core** schema mapping for the export (e.g. Occurrence). | Voucher (schema mapping name) |
| Extensions | One-to-many to `extension` | A one-to-many relationship where many schema mappings can be linked to a single export mapping. | GBIF Identification, CO Audubon Core (schema mapping names) |

**Extensions** `extensions`

A join table that bridges `exportdataset` with `schemamapping` to capture the one-to-many nature of extensions.

| Field | Type | Description | Example |
| --- | --- | --- | --- |
| Mapping | Link to `schemamapping` | Links to the extension’s schema mapping. The system does *not* require the same number of rows as the Core, but the extension query must include the `occurrenceID` (inherited from the Core query) to facilitate the join. | CO Audubon Core |
| ExportDataSet | Link to `exportdataset` | Links to the export data set the extension is connected to. |  |

**Schema Mapping Table** `schemamapping`

A **schema mapping** is a strict wrapper around the standard query system (`spquery`). It segregates "Mapping Queries" from standard "User Queries" in the UI. 

>This is the replacement for the `spexportschemamapping` system in Specify 6.

The distinction between the `spquery` and the `schemamapping` record should be invisible for the user for all intents and purposes. When the user creates a `schemamapping`, it should ask them if this is a "Core" or "Extension" mapping, and they can provide a description. On the user side of things, the title of the query can be used to identify the mapping.

Implementation is up to the development team as to whether this table is needed or if extensions to the `spquery` table is sufficient.

This table defines whether the underlying query is a **Core** (e.g., Occurrence) or an **Extension** (e.g., Audubon Core).

| Field | Type | Description | Example |
| --- | --- | --- | --- |
| Query | Link to `spquery` | **One-to-One** link to `spquery`. The underlying query engine handles the logic. (Required) |  |
| MappingType | Enum | Defines the role: `Core` (Occurrence) or `Extension`. (Required) | Core |
| Description | Text | User-facing description of what this mapping achieves. | Maps Collection Object to DwC Occurrence |

**Query Field Extensions** `spqueryfield` modification

The existing `spqueryfield` table is extended to support mapping specific columns to terms and supporting static values. This allows the `term` to be associated with the specific output column, rather than the query as a whole.

| Field | Description | Example |
| --- | --- | --- |
| Term | Nullable link to `spterm`. If set, this column is exported with the Term Name as the header. | CatalogNumber |
| IsStatic | Boolean. If `true`, the `StringId` is ignored. | True |
| StaticValue | The actual static text to export if `IsStatic` is true. | "PreservedSpecimen" |

**Terms** `spterm`

The **Terms** table/resource acts as the controlled vocabulary for Darwin Core and extension terms. This table represents the version currently supported by Specify.

* **System Terms:** Read-only terms provided by Specify updates.
* **Custom Terms:** Users can add new terms to support new extensions, but cannot edit system terms.

> [!IMPORTANT]
> Instead of `StringID`, if a mapping path is better (e.g. field names connected together) we should use that since it is easily understood by the user.

| Field | Description | Example |
| --- | --- | --- |
| IRI | IRI (Internationalized Resource Identifier) is a unique, stable, and machine-readable identifier for a resource. This is used when constructing the `meta.xml` for publishing. | [http://rs.tdwg.org/dwc/terms/catalogNumber](http://rs.tdwg.org/dwc/terms/catalogNumber) |
| Term | A term is a standardized metadata element from a vocabulary used to consistently describe and share collections data such as specimens, observations, and related information. | catalogNumber |
| Description | The description provided by the schema (Read Only). | An identifier (preferably unique) for the record within the data set or collection. |
| StringID | The field `stringId` for the path added to the query to assist in automatically mapping it. | 1.collectionobject.catalogNumber |
| VocabularyURI | Groups terms by schema. | [http://rs.tdwg.org/dwc/terms/](https://www.google.com/search?q=http://rs.tdwg.org/dwc/terms/) |
| IsSystem | Indicates if this is a system-provided term. | True |

---

## High-Level Entity Relationship Diagram (ERD)

This diagram visualizes the relationships between the new tables and the existing system (`spquery`, `spqueryfield`, `spappresource`, `collection`).

```mermaid
erDiagram
    %% The main configuration object for an export
    exportdataset ||--|| schemamapping : "has Core Mapping"
    exportdataset ||--o{ extensions : "has Extensions"
    exportdataset ||--o{ spappresource : "has Metadata (EML)"
    exportdataset ||--o{ collection : "belongs to"

    %% The extension join table
    extensions }o--|| schemamapping : "uses Mapping"

    %% The Schema Mapping wrapper
    schemamapping ||--|| spquery : "wraps Query"
    schemamapping }o--|| specifyuser : "owned by"

    %% The Query and its fields
    spquery ||--o{ spqueryfield : "contains fields"

    %% The Term definition
    spqueryfield }o--|| spterm : "maps to Term"

    %% Term definitions
    spterm {
        string IRI
        string Term
        string VocabularyURI
        boolean IsSystem
    }

    %% Query Field Extensions
    spqueryfield {
        string FieldName
        boolean IsStatic
        string StaticValue
    }

    %% Schema Mapping Types
    schemamapping {
        enum MappingType "Core/Extension"
    }

```

## Data Flow: Defining an Export

This sequence diagram demonstrates the workflow for an Institution Admin to define mappings and create an export data set.

```mermaid
sequenceDiagram
    participant User as Institution Admin
    participant UI as Specify 7 UI
    participant Backend as Export Engine
    participant DB as Database

    Note over User, DB: Step 1: Define Mappings (Core & Extensions)

    User->>UI: Create New Schema Mapping (Core)
    UI->>User: Display Query Builder with "DwC Term" column
    User->>UI: Select Fields & Map to Terms
    User->>UI: Save Mapping as "Core"
    UI->>DB: Insert into schemamapping & spquery

    User->>UI: Create New Schema Mapping (Extension)
    UI->>User: Display Query Builder
    User->>UI: Select Fields (must include occurrenceID)
    User->>UI: Map to Terms & Save as "Extension"
    UI->>DB: Insert into schemamapping & spquery

    Note over User, DB: Step 2: Create Export Data Set

    User->>UI: Create New Export Data Set
    User->>UI: Select Core Mapping
    User->>UI: Select Extension Mappings
    User->>UI: Link EML Metadata Resource
    User->>DB: Save exportdataset

    Note over User, DB: Step 3: Execution (Publishing)

    User->>UI: Click "Export" / Auto-Scheduler triggers
    UI->>Backend: Request Export (ID)
    Backend->>DB: Fetch Core Query & Extension Queries via exportdataset
    Backend->>Backend: Execute Core Query -> Write Core CSV
    Backend->>Backend: Execute Extension Queries -> Write Ext CSVs
    Backend->>Backend: Generate meta.xml from mappings
    Backend->>Backend: Package ZIP (DwCA)
    Backend->>User: Return Download Link / Update RSS

```

## Logic Flow: Field Mapping Execution

The backend needs to determine what value to write for a specific column in the export file.

```mermaid
flowchart TD
    Start[Start Export Row Processing] --> NextField{Next Column?}
    NextField -- Yes --> CheckStatic{IsStatic = True?}
    NextField -- No --> End[Finish Row]

    CheckStatic -- Yes --> WriteStatic[Write 'StaticValue' to CSV]
    WriteStatic --> NextField

    CheckStatic -- No --> CheckPath{Has StringID Path?}
    CheckPath -- Yes --> FetchDB[Fetch Value from DB using StringID]
    FetchDB --> WriteDB[Write DB Value to CSV]
    WriteDB --> NextField

    CheckPath -- No --> WriteNull[Write Empty/Null to CSV]
    WriteNull --> NextField

```

---
## Original Issue
> @grantfitzsimmons:
> 
> A user interface for mapping a Specify query to Darwin Core terms. This could be used multiple times throughout the interface, used to calculated MIDS levels for each record (#4604), share data easily to GBIF and other data aggregators, and much more. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify Data Publishing Toolkit #285

Requirements from @grantfitzsimmons:

Goals

Non-Functional Requirements

Functional Requirements

Schema Configuration

Query Builder

Darwin Core Mapping

Validation

Data Output

DwCA Ecological Metadata

RSS Publishing

Permissions

User Tool

Darwin Core Updates & Versioning

Additional Deliverables

Proposed Model

High-Level Entity Relationship Diagram (ERD)

Data Flow: Defining an Export

Logic Flow: Field Mapping Execution

Original Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Type	Description	Example
ExportName	Text	The name of the export.	KUBI Ichthyology Voucher
FileName	Text	The name of the export file once packaged, always ending with `.zip`.	kui-dwca.zip
RSS	Checkbox	Indicates if this should be made available via the RSS feed when updated	Yes
Frequency	Integer	If published, this represents the number of days between automatically updating the RSS feed
Metadata	Link to `spappresource`	A link to the app resource containing the Ecological Metadata Language (EML) created for the data set being published.	EML data sourced from GBIF or created using the GBIF EML generator
CoreMapping	Link to `schemamapping`	Links to the primary Core schema mapping for the export (e.g. Occurrence).	Voucher (schema mapping name)
Extensions	One-to-many to `extension`	A one-to-many relationship where many schema mappings can be linked to a single export mapping.	GBIF Identification, CO Audubon Core (schema mapping names)

Field	Type	Description	Example
Mapping	Link to `schemamapping`	Links to the extension’s schema mapping. The system does not require the same number of rows as the Core, but the extension query must include the `occurrenceID` (inherited from the Core query) to facilitate the join.	CO Audubon Core
ExportDataSet	Link to `exportdataset`	Links to the export data set the extension is connected to.

Field	Type	Description	Example
Query	Link to `spquery`	One-to-One link to `spquery`. The underlying query engine handles the logic. (Required)
MappingType	Enum	Defines the role: `Core` (Occurrence) or `Extension`. (Required)	Core
Description	Text	User-facing description of what this mapping achieves.	Maps Collection Object to DwC Occurrence

Field	Description	Example
Term	Nullable link to `spterm`. If set, this column is exported with the Term Name as the header.	CatalogNumber
IsStatic	Boolean. If `true`, the `StringId` is ignored.	True
StaticValue	The actual static text to export if `IsStatic` is true.	"PreservedSpecimen"

Field	Description	Example
IRI	IRI (Internationalized Resource Identifier) is a unique, stable, and machine-readable identifier for a resource. This is used when constructing the `meta.xml` for publishing.	http://rs.tdwg.org/dwc/terms/catalogNumber
Term	A term is a standardized metadata element from a vocabulary used to consistently describe and share collections data such as specimens, observations, and related information.	catalogNumber
Description	The description provided by the schema (Read Only).	An identifier (preferably unique) for the record within the data set or collection.
StringID	The field `stringId` for the path added to the query to assist in automatically mapping it.	1.collectionobject.catalogNumber
VocabularyURI	Groups terms by schema.	http://rs.tdwg.org/dwc/terms/
IsSystem	Indicates if this is a system-provided term.	True

Specify Data Publishing Toolkit #285

Description

Requirements from @grantfitzsimmons:

Goals

Non-Functional Requirements

Functional Requirements

Schema Configuration

Query Builder

Darwin Core Mapping

Validation

Data Output

DwCA Ecological Metadata

RSS Publishing

Permissions

User Tool

Darwin Core Updates & Versioning

Additional Deliverables

Proposed Model

High-Level Entity Relationship Diagram (ERD)

Data Flow: Defining an Export

Logic Flow: Field Mapping Execution

Original Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions