EGAP (Entheome Genome Assembly Pipeline) is a versatile bioinformatics pipeline for hybrid genome assembly using Oxford Nanopore (ONT), Illumina, and PacBio data. It evaluates assemblies based on BUSCO Completeness (Single + Duplicated), Assembly Contig Count, and N50, with additional metrics like L50 and GC-content available via QUAST.
-
Preprocess & QC Reads
- Merges multiple FASTQ files (
ont_combine_fastq_gz,illumina_extract_and_check). - Trims and removes adapters (Trimmomatic, BBDuk).
- Deduplicates reads (Clumpify).
- Filters and corrects ONT reads (Filtlong, Ratatosk).
- Generates Read Metrics (FastQC, NanoPlot, BBMap insert-size stats).
- Merges multiple FASTQ files (
-
Assembly
- MaSuRCA: Illumina-only or hybrid (ONT/PacBio).
- Flye: ONT-only or PacBio-only.
- SPAdes: Illumina-only or hybrid (ONT/PacBio).
- hifiasm: PacBio-only.
- Best Assembly Selection based on Read Metrics from all available assemblies.
- Runs BUSCO/Compleasm on two lineages for completeness.
- Runs QUAST for contiguity (N50, contig count, etc.).
-
Assembly Polishing
- Polishes with Racon (2x, if ONT/PacBio) and Pilon (if Illumina).
- Removes haplotigs with purge_dups (if long reads).
-
Assembly Curation
- Scaffolds and patches with RagTag (if reference provided).
- Closes gaps with TGS-GapCloser (ONT) or Abyss-Sealer (Illumina-only).
-
Quality Assessments & Classification
- Runs BUSCO/Compleasm on two lineages for completeness.
- Runs QUAST for contiguity (N50, contig count, etc.).
- Classifies assemblies as AMAZING, GREAT, OK, or POOR.
Optimized for fungal genomes, EGAP is adaptable to other organisms by adjusting lineages and references.
Supported Input Modes:
- Illumina-only (SRA, DIR, or RAW FASTQ)
- Illumina + Reference (GCA or FASTA)
- Illumina + ONT (SRA, DIR, or RAW FASTQ)
- Illumina + ONT + Reference
- PacBio-only (SRA, DIR, or RAW FASTQ)
- Assembly-only (for QC analysis)
Future developments: Support for ONT-only and ONT + Reference.
- Overview
- Installation
- Pipeline Flow
- Command-Line Usage
- CSV Generation
- Example Data & Instructions
- Quality Control Output Review
- Future Improvements
- References
- Contribution
- License
The following tools are installed:
- Trimmomatic
- BBMap
- FastQC
- NanoPlot
- Filtlong
- Ratatosk
- gfatools
- hifiasm
- MaSuRCA
- Flye
- SPAdes
- Racon
- Burrows-Wheeler Aligner
- SamTools
- BamTools
- Pilon
- purge_dups
- RagTag
- TGS-GapCloser
- ABYSS-Sealer
- QUAST
- BUSCO
- Compleasm
The available shell script: EGAP_setup.sh, can install all dependencies (Python 3.8+, Conda, and the main bioinformatics tools):
bash /path/to/EGAP/bin/EGAP_setup.shOpen a terminal in the directory where the Dockerfile is located and run:
docker build -t entheome_ecosystem .Run the container (adjust the path accordingly):
docker run -it -v /path/to/data/mnt:/path/to/data/mnt entheome_ecosystem bashInside the Docker container, load the pre-generated EGAP environment:
source /EGAP_env/bin/activateOpen a terminal in the directory where the entheome.sif.def is located and run:
sudo singularity build entheome.sif entheome.sif.def Edit the parameters in the nextflow.config and run (ensure the entheome.sif is in the same directory that draft_assembly.nf and nextflow.config are in):
nextflow draft_assembly.nf -with-singularity entheome.sifOR
Load into the Singularity image, load the pre-generated EGAP environment:
singularity shell entheome.sif -B /path/to/data/mnt:/path/to/data/mnt && \
source /opt/conda/etc/profile.d/conda.sh && \
conda activate EGAP_envIn a dedicated environment through the Bioconda channel with the following command:
conda create -y -n EGAP_env python=3.8 && conda activate EGAP_env && conda install -y -c bioconda egap--input_csv,-csv(str): Path to CSV with sample data (default: None).--output_dir,-o(str): Path to the desired output directory (default: None).--cpu_threads,-t(int): CPU threads (default: 1).--ram_gb,-r(int): RAM in GB (default: 8).
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 1 -r 8It is necssary to provide a CSV file containing the necessary information for each sample.
The CSV file should have the following header and columns:
| ONT_SRA | ONT_RAW_DIR | ONT_RAW_READS | ILLUMINA_SRA | ILLUMINA_RAW_DIR | ILLUMINA_RAW_F_READS | ILLUMINA_RAW_R_READS | PACBIO_SRA | PACBIO_RAW_DIR | PACBIO_RAW_READS | SPECIES_ID | SAMPLE_ID | ORGANISM_KINGDOM | ORGANISM_KARYOTE | BUSCO_1 | BUSCO_2 | EST_SIZE | REF_SEQ_GCA | REF_SEQ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | None | None | SRA00000001 | None | None | None | None | None | None | Ab_sample1 | Ab_sample1 | Funga | Eukaryote | basidiomycota | agaricales | 55m | GCA00000001.1 | None |
| None | None | /path/to/ONT/sample1.fq.gz | None | None | /path/to/Illumina/sample1_1.fq.gz | /path/to/Illumina/sample1_2.fq.gz | None | None | None | Ab_sample2 | Ab_sample2_ONT | Funga | Eukaryote | basidiomycota | agaricales | 60m | None | None |
| None | None | None | None | None | None | None | None | None | /path/to/pacbio.fastq.gz | Ab_sample3 | Ab_sample3_Illu | Funga | Eukaryote | basidiomycota | agaricales | 55m | None | None |
| None | None | None | SRA00000002 | None | None | None | None | None | /path/to/pacbio.fastq.gz | Ab_sample4 | Ab_sample4_sub-name | Funga | Eukaryote | basidiomycota | agaricales | 55m | GCA00000002.1 | None |
- ONT_SRA: Oxford Nanopore Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - ONT_RAW_DIR: Path to the directory containing all Raw ONT Reads. Use
Noneif specifying individual files. - ONT_RAW_READS: Path to the combined Raw ONT FASTQ reads (e.g.,
/path/to/ONT/sample1.fq.gz). - ILLUMINA_SRA: Illumina Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - ILLUMINA_RAW_DIR: Path to the directory containing all Raw Illumina Reads. Use
Noneif specifying individual files. - ILLUMINA_RAW_F_READS: Path to the Raw Forward Illumina Reads (e.g.,
/path/to/Illumina/sample1_1.fq.gz). - ILLUMINA_RAW_R_READS: Path to the Raw Reverse Illumina Reads (e.g.,
/path/to/Illumina/sample1_2.fq.gz). - PACBIO_SRA: PacBio Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - PACBIO_RAW_DIR: Path to the directory containing all Raw PacBio Reads. Use
Noneif specifying individual files. - PACBIO_RAW_READS: Path to the combined Raw PacBio FASTQ reads (e.g.,
/path/to/PACBIO/sample1.fq.gz). - SPECIES_ID: Species ID formatted as
<full species name>(e.g.,Escherichia_coli). - SAMPLE_ID: Sample ID formatted as
<full species name>-<other identifiers>(e.g.,Escherichia_coli-Illu-SRR32496875). - ORGANISM_KINGDOM: Kingdom of the organism (default:
None). - ORGANISM_KARYOTE: Karyote type of the organism (default:
None). - BUSCO_1: Name of the first compleasm/BUSCO database (default:
None). - BUSCO_2: Name of the second compleasm/BUSCO database (default:
None). - EST_SIZE: Estimated genome size in Mbp (e.g.,
None). - REF_SEQ_GCA: Curated Genome Assembly (GCA) Accession number (or
None). - REF_SEQ: Path to the reference genome for assembly (or
None).
- If you are providign ANY raw reads, ensure they exist in their appropriate folder (
/path/to/sample_dir/Illumina,/path/to/sample_dir/ONT, etc); you may need to generate the sample_dir based on the output_dir, species_id, and then sample_id (/output_dir/species_id/sample_id). - If you are providing
ILLUMINA_RAW_F_READSandILLUMINA_RAW_R_READS, please make sure the directory path the files are in DO NOT CONTAIN_1or_2, but the actual READS FILES DO CONTAIN_1or_2. - If you provide a value for
ILLUMINA_RAW_DIR, setILLUMINA_RAW_F_READSandILLUMINA_RAW_R_READStoNone. EGAP will automatically detect and process all paired-end reads within that directory. The same applies forONT_RAW_DIR. - Ensure that all file paths are correct and accessible.
- The CSV file should not contain extra spaces or special characters in the headers.
- If you just want to perform QC analysis for an already built assembly: provide the path for the assembly or GCA Accession number to download, in the
REF_SEQorREF_SEQ_GCAfield respectively, provideORGANISM_KARYOTE, and the two BUSCO databases (BUSCO_1,BUSCO_2) to use; DO NOT PROVIDE ESTIMATED SIZE (EST_SIZE).
EGAP_test.csv is included in this repository to run test examples, to run all four files takes about 24hours on a 16 thread, 64GB system.
ONT_SRA,ONT_RAW_DIR,ONT_RAW_READS,ILLUMINA_SRA,ILLUMINA_RAW_DIR,ILLUMINA_RAW_F_READS,ILLUMINA_RAW_R_READS,PACBIO_SRA,PACBIO_RAW_DIR,PACBIO_RAW_READS,SAMPLE_ID,SPECIES_ID,ORGANISM_KINGDOM,ORGANISM_KARYOTE,BUSCO_1,BUSCO_2,EST_SIZE,REF_SEQ_GCA,REF_SEQ
None,None,None,None,None,None,None,None,None,None,Escherichia_coli-RefSeq,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,None,GCA_000005845.2,None
None,None,None,SRR32496875,None,None,None,None,None,None,Escherichia_coli-Illu-RefSeq,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,GCA_000005845.2,None
SRR32405433,None,None,SRR32496875,None,None,None,None,None,None,Escherichia_coli-ONT-Illu,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,None,None
None,None,None,None,None,None,None,SRR31460895,None,None,Escherichia_coli-PacBio,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,None,None
If you are providing your own data locally, be sure to have a species folder and *if needed a sub-folder matching your Species ID: Example: Illumina Only data for Psilocybe cubensis B+ with reference sequence of Psilocybe cubensis
- /path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/f_reads.fastq.gz
- /path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/r_reads.fastq.gz)
- /path/to/EGAP/EGAP_Processing/Ps_cubensis/ref_seq.fasta
If no sub-folder for sub-species is needed then place everything in the main species folder i.e.:
- /path/to/EGAP/EGAP_Processing/Ps_semilanceata/Illumina/f_reads.fastq.gz
EGAP generates final assemblies along with:
- QUAST metrics (contig count, N50, L50, GC%, coverage)
- BUSCO/Compleasm plots showing Single, Duplicated, Fragmented & Missing scores.
- Final assembly classification: AMAZING, GREAT, OK, or POOR
The current thresholds for each metric classification (subject to change) are:
- first_busco_c = {"AMAZING": =>98.5, "GREAT": =>90.0, "OK": =>75.0, "POOR": <75.0}
- second_busco_c = {"AMAZING": =>98.5, "GREAT": =>90.0, "OK": =>75.0, "POOR": <75.0}
- contigs_thresholds = {"AMAZING": <100, "GREAT": <1000, "OK": <10000, "POOR": >10000}
- n50_thresholds = {"AMAZING": >100000, "GREAT": >10000, "OK": >1000, "POOR": <1000}
- l50_thresholds = {"AMAZING": #, "GREAT": #, "OK": #, "POOR": #} (still determining best metrics)
BUSCO outputs are evaluated based on:
- Greater than or equal to 98.5% Completion (sum of Single and Duplicated genes) for an AMAZING/Great Assembly
- Greater than 90.0% Completion for a Good Assembly
- Greater than 75% Completion for an OK Assembly
- Less than 75% Completion for a POOR Assembly
Additionally, fewer contigs aligning to BUSCO genes is preferable. Contigs with only duplicated genes are excluded from the plot (noted in the x-axis label).
**Ps. cubensis B+ agaricales BUSCO** |
**Ps. cubensis B+ basidiomycota BUSCO** |
**Ps. semilanceata agaricales BUSCO** |
**Ps. semilanceata basidiomycota BUSCO** |
**Pa. papilionaceus agaricales BUSCO** |
**Pa. papilionaceus basidiomycota BUSCO** |
- Automated Quality Assessment Reports: Generate comprehensive quality reports post-assembly for easier analysis.
- Improved Data Management: Automated removal of excess files upon pipeline completion.
- Enhanced Support for Diverse Genomes: Optimize pipeline parameters for non-fungal genomes to improve versatility.
- Improved Error Handling: Develop robust error detection and user-friendly feedback.
- Integration with Additional Sequencing Platforms:
- Support for ONT input only and ONT input with Reference sequence.
This pipeline was modified from two of the following pipelines:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24
Muñoz-Barrera A, Rubio-Rodríguez LA, Jáspez D, Corrales A, Marcelino-Rodriguez I, Lorenzo-Salazar JM, González-Montelongo R, Flores C. Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data.
bioRxiv 2024.05.28.595812; doi: 10.1101/2024.05.28.595812
The example data are published in:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24
McKernan K, Kane L, Helbert Y, Zhang L, Houde N, McLaughlin S. A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production. F1000Research 2021, 10:961; doi: 10.12688/f1000research.55301.2
Ruiz‐Dueñas FJ, Barrasa JM, Sánchez‐García M, Camarero S, Miyauchi S, Serrano A, Linde D, Babiker R, Drula E, Ayuso‐Fernández I, Pacheco R, Padilla G, Ferreira P, Barriuso J, Kellner H, Castanera R, Alfaro M, Ramírez L, Pisabarro AG, Riley R, Kuo A, Andreopoulos W, LaButti K, Pangilinan J, Tritt A, Lipzen A, He G, Yan M, Ng V, Grigoriev IV, Cullen D, Martin F, Rosso M, Henrissat B, Hibbett D, Martínez AT. Genomic Analysis Enlightens Agaricales Lifestyle Evolution and Increasing Peroxidase Diversity. Molecular Biology and Evolution. 38(4): 1428-1446 (2020). 10.1093/molbev/msaa301.
Floudas D, Bentzer J, Ahrén D, Johansson T, Persson P, Tunlid A. Uncovering the hidden diversity of litter-decomposition mechanisms in mushroom-forming fungi. ISME J 14, 2046–2059 (2020). 10.1038/s41396-020-0667-6.
If you would like to contribute to the EGAP Pipeline, please submit a pull request or open an issue on GitHub. For major changes, please discuss via an issue first.
This project is licensed under the BSD 3-Clause License.







