Skip to content

Scripts, information, and configuration files pertaining to the analysis of Long Reads from Whole Genome Sequencing (WGS), specifically ONT data.

License

Notifications You must be signed in to change notification settings

faryabiLab/Long-Read-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository stores conda environment manifests and scripts necessary for Whole-Genome Long Read Sequencing data, specifically Oxford Nanopore Technologies (ONT).

Environments

Use conda to set up compute environments for the scripts below.
Available environments:

Environment Description Usage
envs/ont-env.yml Long read analysis-specific software ONT-specific analyses.
envs/hic_process.yml HiC/MicroC alignment and processing with Juicer. HiC alignment and processing
envs/eaglec.yml Structural variant detection from HiC contact matrices with EagleC. SV detection via HiC matrix.
envs/herro/Dockerfile Herro read correction (Docker container) GPU-enabled read error correction with Herro

Install and activate any of the conda environments with:

conda env create -f <env.yml>
conda activate <env>

Note: The name of each environment is the filename without .yml. \

Build the herro Docker conainer:

cd envs/herro
docker build --name herro:latest .

Scripts

This section highlights the analysis scripts in this repository, broken down by analysis type.

Logging

Upon execution, each of the below scripts will generate an executable script in the directory from which it was run containing metadata on the run and the command(s) needed to reproduce the result.

  • Executables are named as {script}_{date}_{time}.sh
    • Date = YYYYMMDD; Time = HHMMSS (24hr time)
  • These generated scripts can be run as-is with no parameters to reproduce its result, i.e. ./{script}_{date}_{time}.sh
  • Can be run from anywhere.

Headers

Each of the generated executable scripts will have a standardized header section with run metadata (I suggest adding this to future scripts as well).

# ========================================
# Run on: <server name>
# Run by: <username>
# Environment: <conda env name>
# Run in directory <dir executed from>
# Date:
# Time:
# ========================================

The logging scheme tells the Who, What, When, Where of a run.

Quality Control / Filtering

Script Description Tool(s) Environment
scripts/seqkit_stats.sh Simple wrapper of seqkit stats command to get assembly statistics like average length and N50. seqkit ont-env
scripts/filter_reads_length.sh Wraps FiltLong to filter out reads in the bottom 10% by length. FiltLong ont-env
scripts/nanoplot_run.sh Use NanoPlot to generate summary statistics for fastq NanoPlot ont-env
scripts/porechop_trim.sh Wraps porechop for adapter trimming. porechop ont-env

Herro Read Error Correction

Script Description Environment
scripts/herro_preprocess.sh Runs the herro pipeline preprocessing step. herro/Dockerfile
scripts/herro_batch_align.sh Runs the herro batch alignment workflow. herro/Dockerfile
scripts/herro_correct.sh Runs the herro read error correction pipeline. herro/Dockerfile

Alignment

Script Description Tool(s) Environment
scripts/align_ont_dorado.sh Align ONT fastq with dorado align. Dorado (minimap2) ont-env
scripts/align_contigs_to_ref.sh Wraps minimap2, used to align assembled contigs to reference sequence. minimap2 ont-env

Alignment Extraction

Script Description Tool(s) Environment
scripts/extract_split_reads.sh Extract split reads and reads providing context in flanking regions by specifying a window size in bp. samtools ont-env
scripts/extract_bnd_reads_noWindow.sh Extract split reads by specifying chromosomal start and end coordinates, no window needed. samtools ont-env
scripts/extract_spanning_contigs.sh After aligning assembled contigs to a reference genome with align_contigs_to_ref.sh, extract contigs that align to multiple chromosomes. samtools ont-env

Assembly

Script Description Tool(s) Environment
scripts/hifiasm_assembly.sh Assemble a fastq using hifiasm. HiFiasm ont-env
scripts/flye_assembly.sh Assemble a fastq using flye. Flye ont-env
scripts/ragtag_scaffold.sh Scaffold conntigs together with RagTag. RagTag ont-env
Tool Description
D-GENIES UI used for generating dotplots from paired fasta

Structural Variant Discovery

Script Description Tool(s) Environment
scripts/eaglec_bnds.sh Find structural variants from a 3C matrix in .mcool format. EagleC eaglec
scripts/spectre_cnv.sh Calculate & plot copy number information from read depth. spectre ont-env

Assembly Annotation

Script Description Tool(s) Environment
scripts/liftoff_gene_annotations.sh USe a reference .gtf and .fasta to annotate gene sequences in assembly .fasta Liftoff ont-env

Utilities / Miscellaneous

Script Description Environment
scripts/parse_cigar.py Parse and plot CIGAR strings within a sam file. ont-env
sctipts/make_bnd_seq.sh Subsequence 2 sequences from a reference fasta file as given coordinates, returning them as separate fasta entries in a single file. ont-env
sctipts/make_bnd_seq_concat.sh Concatenate 2 sequences from a reference fasta file as given coordinates, returning them as a single fasta entry. ont-env
scripts/gfa_to_fasta.sh Convert a gfa file to a fasta file with gfatools ont-env

Browsers

JBrowse2

  • Upload track files in data/JBrowse_Sessions to view alignments.

Workflows

TBA

Tool List

All tools below are available as conda environments in envs/.
Specifics of custom parameters can be found in TOOL-TIPS.md.

Resources

More tools can be found on Long-Read-Tools.

About

Scripts, information, and configuration files pertaining to the analysis of Long Reads from Whole Genome Sequencing (WGS), specifically ONT data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published