Usage example:
python nfdi-ena-cli.py --metadata example.tsv --fasta-dir fasta --ena-user 'your username' --ena-password 'your password' --study-name 'study example' --study-title 'title for the study' --study-description 'description for the study'This system aims to automate the validation and submission of metadata and sequencing data to the European Nucleotide Archive (ENA), following the metadata standards defined by the MIXS specification.
-
Metadata validation:
- Date format verification (ISO 8601).
- Expected value checks and unit validation.
- Controlled vocabulary and ontology validation (e.g., ENVO, CHEBI, NCBI Taxonomy).
-
Automated submission:
- Upload of metadata and sequencing data files.
- Integration with the ENA submission API (planned for future phases).
In its initial version, this system supports only Terrestrial metadata.
The metadata fields and requirements are based on the MIXS specification and are described below.
| Category | Metadata | Definition | Reference | Expected Value / Unit | Example | ||
|---|---|---|---|---|---|---|---|
| Project metadata | project_name | Name of the project within which the sequencing was organized | MIXS:0000092 | Free text string | Forest soil metagenome | ||
| Site metadata | collection_date | The time of sampling, either as an instance (single point in time) or interval. ISO8601 format compliant | MIXS:0000011 | YYYY-MM-DD | 2013-03-25T12:42:31+01:00 | ||
| collected_by | Name of person or institute that collected the sample | ENA Reference | Free text string | UFZ - Centre for environmental research | |||
| geo_loc_name | Geographic location (country/sea and region). Use INSDC/GAZ list | MIXS:0000010 | Free text or ontology | USA: Maryland, Bethesda / GAZ:00051071 | |||
| lat | Latitude in decimal degrees (WGS84) | MIXS:0000009 | Decimal degrees, max 8 decimals | -41.373744 | |||
| lon | Longitude in decimal degrees (WGS84) | MIXS:0000009 | Decimal degrees, max 8 decimals | 146.266145 | |||
| elev | Elevation from Earth's surface in meters | MIXS:0000093 | Meter | 100 m | |||
| alt | Altitude above Earth's surface | MIXS:0000094 | Meter | 100 m | |||
| depth | Depth below surface (e.g., soil, sediment) | MIXS:0000018 | Meter | 100 m | |||
| env_broad_scale | Major environmental system(s) (e.g., biome). Use EnvO terms | MIXS:0000012 | Ontology terms separated by " | " | aquatic biome [ENVO:00002030] | terrestrial biome [ENVO:00000446] | |
| env_local_scale | Environmental entities near sample. Use subclass of env_broad_scale | MIXS:0000013 | Ontology terms separated by " | " | woodland biome [ENVO:01000175] | tundra biome [ENVO:01000180] | |
| env_medium | Environmental materials in contact with the sample | MIXS:0000014 | Ontology terms separated by " | " | arable soil [ENVO:00005742] | bulk soil [ENVO:00005802] | |
| chem_administration | Chemicals applied to host or site. Use CHEBI IDs | MIXS:0000751 | CHEBI;timestamp; multiple values separated by " | " | agar [CHEBI:2509];2018-05-11T20:00Z | castor oil [CHEBI:140618];2023-12-07T17:00+02:00 | |
| temp | Environmental temperature | MIXS:0000113 | Degree Celsius | 25 degree Celsius | |||
| salinity | Environmental salinity | MIXS:0000183 | Practical salinity unit or percentage | 25 practical salinity unit | |||
| pH | Environmental pH | MIXS:0001001 | pH value | pH 7.2 | |||
| Sample metadata | samp_name | Local sample identifier (used in sequencing, unique per submitter) | MIXS:0001107 | Free text | Soil1Sample2Seq2 | ||
| source_mat_id | Unique ID of the material sample used for extraction | MIXS:0000001 | Culture collection IDs or unique local ID | MPI012345 | |||
| samp_size | Total amount of sample (volume, mass, area) | MIXS:0000001 | ml, g, m² | 2000 ml | 1000 g soil | ||
| temp | Sample temperature at time of sampling | MIXS:0000113 | Degree Celsius | 25 degree Celsius | |||
| salinity | Total concentration of dissolved salts | MIXS:0000183 | Practical salinity unit or percentage | 25 practical salinity unit | |||
| ph | pH of the sample or its aqueous phase | MIXS:0001001 | pH value | 7.2 | |||
| samp_taxon_id | NCBI taxon ID of sample or control | MIXS:0001320 | NCBI Taxonomy ID | 749906 | |||
| samp_collect_method | Method of sample collection | MIXS:0001225 | PMID, DOI, URL or free text | ||||
| microbial_isolate | Was a microbial isolate cultured? | — | Y/N | ||||
| microb_cult_med | Microbial culture medium used, if applicable | MIXS:0001216 | Ontology terms or free text | minimal defined medium [MCO:0000881] | |||
| Host metadata | host_taxid | NCBI taxon ID of the host | MIXS:0000250 | NCBI Taxonomy ID | Homo sapiens [NCBI:txid9606] | ||
| host_common_name | Common name of host | MIXS:0000248 | Free text | human | |||
| host_height | Height of host | MIXS:0000264 | cm, mm, m | 177 cm | |||
| host_length | Length of host | MIXS:0000256 | cm, mm, m | 100 cm | |||
| host_tot_mass | Total mass of the host | MIXS:0000263 | kg, g | 77 kg | |||
| host_body_site | Body site from where sample was collected | MIXS:0000867 | FMA or UBERON ontology | gut [FMA:45615] | |||
| host_body_product | Substance produced by the host body (e.g. mucus, blood) | MIXS:0000867 | FMA or UBERON ontology | mucus [FMA:66938] | blood plasma [UBERON:0001969] | ||
| host_age | Age of the host at collection | MIXS:0000255 | year, day, hour | 28 y | |||
| host_sex | Sex of the host | MIXS:0000811 | male, female, unknown | female | |||
| host_diet | Diet of the host | MIXS:0000869 | Free text or ontology | omnivore [ecocore:00000082] | |||
| host_disease_stat | Diagnosed disease(s) of the host | MIXS:0000031 | Free text or Disease Ontology | avian influenza |
- Expansion to other MIXS packages (e.g., host-associated, built environment).
- Full ENA submission automation (metadata XML generation, file uploads).
- GUI interface for simplified data upload.