riboseq-flow - A Nextflow DSL2 pipeline to perform ribo-seq data analysis and comprehensive quality control
If you use riboseq-flow, please cite:
Iosub IA, Wilkins OG and Ule J. Riboseq-flow: A streamlined, reliable pipeline for ribosome profiling data analysis and quality control Wellcome Open Research 2024, 9:179 https://doi.org/10.12688/wellcomeopenres.21000.1
- Introduction
- Pipeline summary
- Quick start (test the pipeline)
- Quick start (run the pipeline on your data)
- Pipeline parameters
- Pipeline outputs
- Pre-download container images
- Authors and contact
- Issues and contributions
- Contributing guidelines
- Why use riboseq-flow
riboseq-flow is a Nextflow DSL2 pipeline for the analysis and quality control of ribo-seq data.
- UMI extraction (
UMI-tools) (Optional) - Adapter and quality trimming, read length filtering (
Cutadapt) (Optional) - Read QC (
FastQC) - Pre-mapping to remove reads mapping to contaminants (
Bowtie2,SAMtools) (Optional) - Mapping to the genome and transcriptome (
STAR) - UMI-based deduplication (
UMI-tools,SAMtools,BEDTools) (Optional) - Extensive QC specialised for ribo-seq (
mapping_length_analysis,R) (Optional) - Gene-level read quantification (
FeatureCounts) - P-site identification, CDS occupancy and P-site diagnostics (
riboWaltz) (Optional) - Principal Component Analysis (PCA) using gene-level read counts and P-site counts over CDS regions (
DESeq2) - Generation of coverage tracks (
deepTools) - Design of sgRNA templates to deplete unwanted abundant contaminants (
Ribocutter) (Optional) - MultiQC report of reads QC, mapping statistics, PCA and summarised ribo-seq QC metrics (
MultiQC) - Tracking reads and visualisation of read fate through the key pipeline steps (
networkD3)
- Ensure
Nextflow(version21.10.3or later) andDockerorSingularity(version3.6.4or later) are installed on your system. Nextflow installation instructions can be found here. We recommend using Nextflow withJava 17.0.9or later.
Note: The pipeline has been tested on with Nextflow versions 21.10.3, 22.10.3, 23.04.2, 23.10.0, 24.10.2 and 25.10.0.
- Pull the desired version of the pipeline from the GitHub repository:
nextflow pull iraiosub/riboseq-flow -r v1.2.0
- Run the pipeline on the provided test dataset:
Using Singularity:
nextflow run iraiosub/riboseq-flow -r v1.2.0 -profile test,singularity
or using Docker:
nextflow run iraiosub/riboseq-flow -r v1.2.0 -profile test,docker
- Check succesful execution.
- Ensure
NextflowandDocker/Singularityare installed on your system. - Pull the desired version of the pipeline from the GitHub repository:
nextflow pull iraiosub/riboseq-flow -r v1.2.0
- Create a samplesheet
samplesheet.csvwith information about the samples you would like to analyse before running the pipeline. It has to be a comma-separated file with 2 columns, and a header row as shown in the example below.
Note: Only single-end read data can be used as input; if you used paired-end sequencing make sure the correct read is used for the analysis.
sample,fastq
sample1,/path/to/file1.fastq.gz
sample2,/path/to/file2.fastq.gz
sample3,/path/to/file3.fastq.gz
- Run the pipeline. The typical command for running the pipeline is as follows (the minimum parameters have been specified):
nextflow run iraiosub/riboseq-flow -r v1.2.0 \
-profile singularity,crick \
-resume \
--input samplesheet.csv \
--fasta /path/to/fasta \
--gtf /path/to/gtf \
--contaminants_fasta /path/to/contaminants_fasta \
--strandedness forward
-profile: specifies a configuration profile. Profiles can give configuration presets for different compute environments. Options aretest,docker,singularityandcrickdepending on the system being used and resources available. Others can be found at nf-core.-resume: specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files'contents as well.
--inputspecifies the input sample sheet--outdirspecifies the output results directory- default:
./results
- default:
--tracedirspecifies the pipeline run trace directory- default:
./results/pipeline_info
- default:
The pipeline is compatible with any well-annotated organism for which a FASTA genone file and GTF annotation (and ideally rRNA and other contaminat sequences) are available. Recommended sources for these files are GENCODE and Ensembl. Important: The GTF file must include UTR annotations, and the format should follow the standards set by Ensembl or GENCODE.
Simplified Option for Human and Mouse:
Use the --org flag to automatically download and set up reference files for human or mouse genomes, eliminating the need to manually provide them.
Available options for --org are GRCh38 (human) and GRCm39 (mouse).
When --org is specified, all annotation files are sourced from the paths in the genomes.config file.
Manual Annotation File Specification:
If --org is not specified, the user needs to provide full paths to all required annotation files:
-
--fastapath to the FASTA genome file -
--gtfpath to the GTF annotation file -
--contaminants_fastapath to the FASTA file of abundant RNA contaminants (like rRNA). (Required if pre-mapping is enabled) -
--star_indexpath to directory of pre-built STAR index (Optional). If not provided, the pipeline will generate the STAR index. -
--transcript_infopath to TSV file with a single representative transcript for each gene, with information on CDS start, length, end and transcript length (Optional). If not provided, the pipeline will create it using the GTF file, selecting the longest CDS transcript per gene. The transcript IDs in this file must match those in the GTF file. An example file format can be found here. -
--transcript_fastapath to transcripts FASTA (full sequence, including CDS and UTRs) and matches thetranscript_infofile. (Required if supplying thetranscript_infofile). If not provided, the pipeline will generate this file for the selected representative transcripts. -
--save_indexsave the Bowtie2 and STAR indexes generated by the pipeline. By default, not enabled.
The pipeline allows the user to set preferred parameter values, or the option to skip pipeline steps, as detailed below. Most parameters have default values, which will be used by the pipeline unless the user overrides them by adding the appropriate options to the run script. Where a default value is missing, the user must provide an appropriate value. All pre-defined parameter values are listed here.
The pipeline supports UMI-based deduplication. If UMIs were used, you must explicitly specify so by providing the appropriate arguments.
--with_umienables UMI-based read deduplication. Use if you used UMIs in your protocol. By default, not enabled.--skip_umi_extractskips UMI extraction from the read in case UMIs have been moved to the headers in advance or if UMIs were not used--umi_extract_methodspecify method to extract the UMI barcode (options:string(default) orregex)--umi_patternspecifies the UMI barcode pattern, e.g. 'NNNNN' indicates that the first 5 nucleotides of the read are from the UMI.--umi_separatorspecifies the UMI barcode separator (default:_; userbc:if Ultraplex was used for demultiplexing)
Read trimming steps are executed in the following order: (i) adaptors and low quality bases are trimmed, (ii) bases that need to be removed from read extremities are removed (e.g. non-templated bases, if applicable) and (iii) reads shorter than a minimum length are filtered out.
--skip_trimmingskip the adapter and quality trimming and length filtering step--save_trimmedsave the final and intermediate FASTQ files produced during the trimming steps. By default, not enabled.--adapter_threeprimesequence of 3' adapter (equivalent to-aincutadapt)--adapter_fiveprimesequence of 5' adapter (equivalent to-gincutadapt)--times_trimmednumber of times a read will be adaptor trimmed (default:1)--cut_endnumber of nucleotides to be trimmed from 5' or 3' end of reads (equivalent to-uincutadapt). Supply positive values for trimming from the 5' end, and negative values for trimming from the 3'end (default0, no residues are trimmed). Important:: This step is perfomed after adapter trimming, and after UMIs have been moved to the read header. Useful for libraries where non-templated nucleotides are part of the read and need trimming (e.g. template-switching, OTTR libraries).--minimum_qualitycutoff value for trimming low-quality ends from reads (default:10)--minimum_lengthminimum read length after trimming (default:20)
Ribocutter is a tool for designing sgRNA templates to deplete unwanted abundant contaminants.
--skip_ribocutterskips Ribocutter--guide_numbernumber of guides to design (default:50)--max_readsmaximum number of reads analysed (default:1000000)--min_read_lengthminimum read length threshold for reads to be analysed--extra_ribocutter_argsstring specifying additional arguments that will be appended to theribocuttercommand
--skip_premapskips pre-mapping to common contaminant sequences--bowtie2_argsallows users to input custom arguments for Bowtie2--star_argsenables the input of specific arguments for STAR
The preset alignment settings for both STAR and Bowtie2 are optimized for standard ribo-seq analysis. Altering these arguments is generally not advised unless specific data requirements or unique analytical needs arise.
--strandednessspecifies the library strandedness (options:forward,reverseorunstranded) (Required)--featurespecifies the feature to be summarised at gene-level (options:exonorCDS) (default:exon)
--skip_qcskips mapping length analysis and generation of ribo-seq QC plots--expected_lengthexpected read lengths range. Used to report the proportion of reads of expected lengths in the aligned reads, for the generation of ribo-seq QC plots, and for specifying the range of read lengths used for P-site identification (default:26:32).
Important: The --expected_length parameter does not filter reads based on this length range for any other analyses, including alignment, gene-level quantification or track data generation.
P-sites are identified with riboWaltz. It is strongly recommended to check the riboWaltz method to ensure the approach is suitable for your data. By default, the reads must be in length bins that satisfy periodicity to be used for P-site offset calculations. Additionally, the user can specify the following options:
-
--skip_psiteskips P-site identification and riboWaltz diagnotics plots -
--expected_lengthexpected read lengths range. Used to report the proportion of reads of expected lengths in the aligned reads, for the generation of ribo-seq QC plots, and for specifying the range of RPF lengths used for P-site identification (default:26:32). -
--periodicity_thresholdspecifies the periodicity threshold for read lengths to be considered for P-site identification (default:50) -
--psite_methodspecifies method used for P-site offsets identification (options:ribowaltz(default) orglobal_max_5end).- For
ribowaltzP-site offsets are defined usingriboWaltz. - For
global_max_5endP-site offsets are defined by the distances between the first nucleotide of the translation initiation site and the nucleotide corresponding to the global maximum of the read length-specific 5'end profiles (the first step of the riboWaltz method). Compared to riboWaltz-defined offsets, only the 5' extremities of the reads are considered for calculation and no further offset correction is performed after read-length global maximum identification.
- For
-
exclude_startspecifies the number of nucleotides 3' from the start codon to be excluded from CDS P-site quantification (default:42, i.e. exclude the first 14 codons) -
exclude_stopspecifies the number of nucleotides 5' from the stop codon to be excluded from CDS P-site quantification (default:27, i.e. exclude the last 9 codons) -
genomic_codon_coverageconvert transcript positions of inferred P-sites mapping on each triplet of annotated CDS and UTR to genomic coordinates and report geomic coordinates and P-site counts in a BED file. By default, not enabled.
Important: P-sites and information are identified using transcriptomic coordinates. For this, a representative transcript is selected for each gene. The selection is based on the following hierarchy: CDS length > total length > number of exons > 5'UTR length > 3'UTR length. The selection is performed automatically by the pipeline using the information in the provided GTF file, and stored in *.longest_cds.transcript_info.tsv. The default representative transcript table can be overriden by the user by providing their own table with --transcript_info (see instructions above).
--bin_sizespcifies bin size for calculating coverage (i.e. the number of nt per bin). Bins are short consecutive counting windows of a defined size. (default:1)--track_formatspecifies output file type (options:bigwig(default) orbedgraph)
The pipeline outputs results in a number of folders:
.
├── annotation
├── preprocessed
├── fastqc
├── premapped
├── mapped
├── deduplicated
├── riboseq_qc
├── featurecounts
├── psites
├── coverage_tracks
├── ribocutter
├── multiqc
└── pipeline_info
annotationcontains information on the representative transcript per gene used for ribo-eq QC and P-site analyses, as well as bowtie2, STAR indexes and annotation files used by the pipeline*.longest_cds.transcript_info.tsvTSV file with a single representative transcript for each gene, with information on CDS start, length and endbowtie2andstarfolders with indexes used for aligning reads (if--save_indexenabled)
preprocessedcontains reads pre-processed according to user settings for UMI extraction and trimming and filtering (if--save_trimmedenabled) and the corresponding logs. The*.filtered.fastq.gzfiles are used for downstream alignment steps.fastqccontains FastQC reports of pre-processed reads (*.filtered.fastq.gz)premappedcontains files from aligning reads to contaminants:*.bamBAM format alignments to contaminants*.bam.seqs.gzmapping locations and sequences of contaminant-mapped reads*.unmapped.fastq.gzsequencing reads that did not map to the contaminants in FASTQ format*.premap.logis the bowtie2 log file
mappedcontains files resulting from alignment to the genome and transcriptome:*.Aligned.sortedByCoord.out.bamread alignments to the genome in BAM format*.Aligned.toTranscriptome.out.bamread alignments to the transcriptome in BAM format*.Log.final.outis the STAR log output file
deduplicatedcontains files resulting after deduplication based on genomic or transcriptomic location and UMIs:*.genome.dedup.sorted.bamUMI deduplicated alignments to the genome, in BAM format*.transcriptome.dedup.sorted.bamUMI deduplicated alignments to the transcriptome in BAM format
riboseq_qccontains QC plots informing on read and mapping lengths, frame, periodicity, signal around start and stop codons, contaminants content, duplication, fraction of useful (mapped to protein-coding transcripts) reads, proprtion of reads remaining after alignment steps etc and folders containing other QC results.*.qc_results.pdfper-sample QC reports with specialised ribo-seq quality metrics stratified by read lengthmapping_length_analysiscontains CSV files with number of raw and mapped reads by length:*.after_premap.csv*.before_dedup.csv*.after_dedup.csv
multiqc_tablescontains TSV files with sample summary metrics for multiQCread_fatecontains sample-specific html files tracking read fate through the pipeline steps, a visualisation that helps understanding the yield of useful reads and troubleshooting.pcacontains PCA plots and rlog-normalised count tables. Only produced if 3 samples or more are analysed.rust_analysiscontainsRUSTmetafootprint analysis, with plots showing the Kullback–Leibler divergence (K–L) profiles stratified by read length, using the inferred P-sites.
featurecountscontains gene-level quantification of the UMI deduplicated alignments to the genomepsitescontains P-sites information, codon coverage and CDS coverage tables, and riboWaltz diagnostic plots:psite_offset.tsv.gzP-site offsets for each read-length for all samples*psite.tsv.gzample-specific P-site information for each readribowaltz_qcfolder containing P-site diagnostic plots generated by riboWaltz*.coverage_psite.tsv.gzP-site counts over the CDS of representative transcripts, or over a CDS window excluding a spcified region downstream start codons and upstream stop codonscodon_coverage_psite.tsv.gzP-sites counts on each triplet of annotated CDS and UTR for each transcriptcodon_coverage_rpf.tsv.gzRPF counts on each triplet of annotated CDS and UTR for each transcriptoffset_plotcontains plots that detail P-site assignment using the method
coverage_trackscontains track files in BED and bigWig or bedGraph format*.bigwigUMI deduplicated genome coverage tracks in bigWig format*.bedgraphUMI deduplicated genome coverage tracks in bedGraph format*.genome.dedup.bed.gzUMI deduplicated alignments to the genome in BED format*.transcriptome.dedup.bed.gzUMI deduplicated alignments to the transcriptomie in BED formatpsite/*.codon_coverage_psite.bed.gzfolder containing BED files with the genomic coordinates and counts of inferred P-sites mapping on each triplet of annotated CDS and UTR for each transcript. Obtained by converting transcriptomic positions fromcodon_coverage_psite.tsv.gzto genomic coordinates.
ribocuttercontains results of Ribocutter run on the pre-processed reads of minimum length defined by user, and minimum length of 23 nt, respectively*.csvCSV files with the designed oligo sequences, their target sequence, the fraction of the library that each oligo targets and the total fraction of the library targeted by all oligosribocutter.pdfplot showing the estimated fraction of the library that is estimated to be targeted by the guides designed using Ribocutter
multiqccontains a MultiQC report summarising the FastQC reports, pre-mapping and mapping logs, and summarised ribo-seq QC metricspipeline_infocontains the execution reports, traces and timelines generated by Nextflow:execution_report.htmlexecution_timeline.htmlexecution_trace.txt
When a Nextflow pipeline requires multiple Docker images, it can sometimes fail to pull them, leading to pipeline execution failures. In such scenarios, pre-downloading the container images to a designated location on your system can prevent this.
Follow these steps to pre-download and cache the necessary images:
- Identify the desired location on your system where you want to store the container images.
- Set the
NXF_SINGULARITY_CACHEDIRenvironment variable to point to this chosen location. For example, you can add the following line to your shell profile or run script:
export NXF_SINGULARITY_CACHEDIR=/path/to/image/cache
- Make sure you have
Singularityinstalled. Please use the same version you intend to use for running the pipeline. - Run the code below to pre-download and cache the required Docker images:
#!/bin/sh
# A script to pre-download Singularity images required by iraiosub/riboseq-flow pipeline
# change dir to your NXF_SINGULARITY_CACHEDIR path
cd /path/to/image/cache
singularity pull iraiosub-nf-riboseq-latest.img docker://iraiosub/nf-riboseq
singularity pull iraiosub-mapping-length-latest.img docker://iraiosub/nf-riboseq-qc
singularity pull iraiosub-nf-riboseq-dedup-latest.img docker://iraiosub/nf-riboseq-dedup
riboseq-flow is written and maintained by Ira Iosub in Prof. Jernej Ule's lab at The Francis Crick Institute. It is based on a Snakemake pipeline in collaboration with its original author, Oscar Wilkins.
Contact email: ira.iosub@crick.ac.uk
Please ensure the paper and Zenodo DOIs above are cited in any reuse, fork, or hosted implementation of this workflow.
riboseq-flow is under active development by Ira Iosub. For queries related to the pipeline, raise an issue on GitHub. If you are interested in building more functionality or want to get involved please reach out.
We welcome contributions to riboseq-flow.
If you wish to make an addition or change to the pipeline, please follow these steps:
- Open an issue to detail the proposed fix or feature and select the appropriate label.
- Fork this repo.
- Create a new branch based on the
devbranch, with a short, descriptive name e.g.feat-coloursfor making changes to a color palette. - Modify the code exclusively on this new branch and mention the relavant issue in the commit messages.
- When your modifications are complete, submit a pull request to the
devbranch describing the changes. - Request a review from iraiosub on your pull request.
- The pull request will trigger a workflow execution on GitHub Actions for continuous integration (CI) of the pipeline. This is designed to automatically test riboseq-flow whenever a pull request is made to the main or dev branches of the repository. It ensures that the pipeline runs correctly in an Ubuntu environment, helping to catch any issues or errors early in the development process.
- User-friendly: requires only minimal command-line experience.
- Portable and reproducible: runs identically across systems using Docker or Singularity.
- Self-contained: no manual dependency installation—each process runs in a container with the exact software it needs.
- Reproducible results: produces the same outputs regardless of platform, a cornerstone of computational reproducibility.
- Scalable: efficiently processes many samples in parallel for comparative studies.
- Integrated QC: combines data processing with extensive ribo-seq–specific QC.
- Transparent: tracks read counts at every step and provides custom QC reports for each sample.
- Insightful: visualises read fate and key metrics, helping users make informed downstream decisions.
- Open and FAIR: fully version-controlled, openly available, and compliant with FAIR principles.
