Skip to content

feature/project-archetype #7

@bwalsh

Description

@bwalsh

📔 Moved from ACED

Project Archetype Structure with META/ Folder

This project follows a standardized structure to manage large research data files and associated FHIR metadata in a version-controlled, DRS and FHIR compatible format.

Overview

The META/ folder contains newline-delimited JSON (.ndjson) files representing FHIR resources describing the project, its data, and related entities. Large files are tracked using Git LFS, with a required correlation between each data file and a DocumentReference resource.


User Story

As a research data steward,
I want to manage all project metadata in standardized FHIR .ndjson files within the META/ folder,
So that I can ensure traceable, reproducible, and DRS, FHIR compatible submissions that clearly link metadata to tracked data files.

Events

  • When: I have a static set of files to associate with a ResearchStudy, Patient, Specimen or Assay
  • When: I have one or more "ad-hoc" (ie workflow) files to associate with a ResearchStudy, Patient, Specimen or ServiceRequest (Assay)
  • When: I have a data source (spreadsheet, files or bespoke system) that describes my one or more "ad-hoc" (ie workflow) files to associate with a ResearchStudy, Patient, Specimen or Assay

Acceptance Criteria

  • The META/ResearchStudy.ndjson file exists and contains at least one valid FHIR ResearchStudy resource.
  • The META/DocumentReference.ndjson file exists and contains exactly one DocumentReference resource per Git LFS-managed file in the project.
  • Each DocumentReference.content.attachment.url matches the relative file path of an actual Git LFS-managed file.
  • All Git LFS-managed files tracked in the repository are represented in the META/DocumentReference.ndjson file.
  • The .ndjson files are properly formatted: one valid JSON object per line.
  • The project includes a .gitattributes file that tracks large files via Git LFS.
  • Automated validation confirms that all required files and metadata correlations are present and consistent.

Directory Structure

<project-root>/
├── .gitattributes
├── .gitignore
├── META/
│   ├── ResearchStudy.ndjson
│   ├── DocumentReference.ndjson
│   ├── Patient.ndjson           (optional)
│   ├── Specimen.ndjson          (optional)
│   ├── ServiceRequest.ndjson    (optional)
│   ├── Observation.ndjson       (optional)
│   └── <Other FHIR>.ndjson      (optional)
├── data/
│   ├── file1.bam
│   ├── file2.fastq.gz
│   └── <additional files>

Required Contents

META/ResearchStudy.ndjson

  • Contains at least one FHIR ResearchStudy resource describing the project.
  • Defines project identifiers, title, description, and key attributes.

META/DocumentReference.ndjson

  • Contains one FHIR DocumentReference resource per Git LFS-managed file.

  • Each DocumentReference.content.attachment.url field:

    • Must exactly match the relative path of the corresponding file in the repository.
    • Example:
{
  "resourceType": "DocumentReference",
  "id": "docref-file1",
  "status": "current",
  "content": [
    {
      "attachment": {
        "url": "data/file1.bam",
        "title": "BAM file for Sample X"
      }
    }
  ]
}

✅ Git LFS-Managed Files

  • All large files tracked with Git LFS, typically under data/.
  • .gitattributes defines file tracking rules.

Optional FHIR Metadata Files

  • Patient.ndjson: Participant records.
  • Specimen.ndjson: Biological specimens.
  • ServiceRequest.ndjson: Requested assays.
  • Observation.ndjson: Measurements or results.
  • Other valid FHIR resource types as required.

File-Metadata Correlation

  • Every Git LFS-managed file must have a corresponding DocumentReference resource.
  • Each DocumentReference.url field directly references the relative file path.
  • Every DocumentReference listed must correspond to an actual file present.

The META validatename? command ensures both FHIR record validity and referential integrity across your project’s META/ folder. Here's what it does:


✅ Syntax

# see legacy g3t
g3t meta validate [--project-root <path>]

🔍 Validation Steps

1. Schema Validation

  • Each .ndjson file in META/ (like ResearchStudy.ndjson, DocumentReference.ndjson, etc.) is read line by line.
  • Every line is parsed as JSON and checked against the corresponding FHIR schema for that resourceType.
  • Syntax errors, missing required fields, or invalid FHIR values trigger clear error messages with line numbers.

2. Mandatory Files Presence

  • Confirms that:

    • ResearchStudy.ndjson exists and has at least one valid record.
    • DocumentReference.ndjson exists and contains at least one record.
  • If either is missing or empty, validation fails.

3. One-to-One Mapping of Files to DocumentReference

  • Scans the working directory for Git LFS-managed files in expected locations (e.g., data/).

  • For each file, locates a corresponding DocumentReference resource whose content.attachment.url matches the file’s relative path.

  • Validates:

    • All LFS files have a matching DocumentReference.
    • All DocumentReferences point to existing files.

4. Project-level Referential Checks

  • Validates that DocumentReference resources reference the same ResearchStudy via relatesTo or other linking mechanisms.

  • If FHIR resources like Patient, Specimen, ServiceRequest, Observation are present, ensures:

    • Their id fields are unique.
    • DocumentReference correctly refers to those resources (e.g., via subject or related fields).

5. Cross-Entity Consistency

  • If multiple optional FHIR .ndjson files exist:

    • Confirms IDs referenced in one file exist in others.
    • Detects dangling references (e.g., a DocumentReference.patient ID that's not in Patient.ndjson).

✅ Example Error Output

ERROR META/DocumentReference.ndjson line 4: url "data/some_missing.bam" does not resolve to an existing file
ERROR META/Specimen.ndjson line 2: id "specimen-123" referenced in Observation.ndjson but not defined

🎯 Purpose & Benefits

  • Ensures all files and metadata are in sync before submission.
  • Prevents submission failures due to missing pointers or invalid FHIR payloads.
  • Enables CI integration, catching issues early in the development workflow.

💡 Recommendation

Incorporate g3t meta validate or new name ❓ into pre-commit hooks or CI pipelines to enforce metadata integrity and maintain standards compliance.


Recommended Setup Workflow

git init
git lfs install
git lfs track "data/*"

mkdir META
# Add ResearchStudy.ndjson and DocumentReference.ndjson

git add .gitattributes META/ data/
git commit -m "Initial project structure with metadata and tracked files"

Validation Requirements

Automated tools or CI processes must:

  • Verify presence of META/ResearchStudy.ndjson with at least one record.
  • Verify presence of META/DocumentReference.ndjson with one record per LFS-managed file.
  • Confirm every DocumentReference.url matches an existing file path.
  • Check proper .ndjson formatting.

Example Minimal Project

my-project/
├── .gitattributes
├── META/
│   ├── ResearchStudy.ndjson      # 1 record
│   ├── DocumentReference.ndjson  # 2 records, one per file below
├── data/
│   ├── sample1.bam
│   ├── sample2.fastq.gz

Conclusion

This structure enables reproducible, FAIR-aligned management of research files and metadata, supporting FHIR-compatible submissions and standardized project organization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions