Skip to content

biodiversitycellatlas/bca_preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

BCA Pre-Processing Pipeline


Table of Contents

  1. Overview
  2. Installation
  3. Setup
  4. Usage
  5. Output
  6. Citations

Overview

This nextflow pipeline is designed to pre-process single-cell and single-nucleus RNA-seq data. It accepts FASTQ files from multiple sequencing platforms, at the moment being:

  • Parse Biosciences
  • BD Rhapsody
  • 10x Genomics
  • OAK-seq
  • Ultima Genomics
  • sci-RNA-seq3
  • and others, when providing a seqspec file

Depending on the chosen sequencing technique, it handles the processing of the FASTQ files accordingly. Whenever possible, we compared our results to a commercial pre-processing pipeline for that sequencing technique. For example, comparing our Parse Biosciences results to the official split-pipe pipeline from Parse Biosciences. While we cannot provide this commercial software directly, you can install it yourself (e.g. by following these instructions), and provide a path where the installation is located in the configuration file. This way, it will be executed alongside of the BCA pre-processing pipeline.

The pipeline will produce the following output files:

  • Quality control reports for raw FASTQ files
  • Raw & Filtered count matrices (exonic, exonic & intronic) from STARsolo and/or Alevin-fry
  • (Optional) mtDNA and rRNA statistics
  • (Optional) Sequencing saturation analysis
  • (Optional) Extended gene annotation file
  • (Optional) Filtered count matrices (h5 files) ​from CellBender step
  • (Optional) Kraken2 taxonomic classification
  • MultiQC report

pipeline


Installation

  1. Download the latest release

Go to the Releases page and download the .zip or .tar.gz file for the version you want.

  1. Unpack the files
unzip bca_preprocessing-<version>.zip
cd bca_preprocessing-<version>

or for tar.gz:

tar -xzf bca_preprocessing-<version>.tar.gz
cd bca_preprocessing-<version>
  1. Download submodules

The pipeline uses two submodules, 10x_saturate to plot the saturation curve and GeneExt for an extended gene annotation file. These are not included automatically, so have to be installed explicitly by running:

bash fetch_submodules.sh
  1. Conda & Nextflow

In order to run the pipeline, you must have Conda and Nextflow installed. When working on a HPC, there might be a module system available to use instead, simplifying the use of different software. To see which modules are available and how to load them:

# List currently loaded modules
module list

# Check for available Nextflow modules
module avail Nextflow

# Load Nextflow module (make sure to change version accordingly!)
module load Nextflow/24.04.3

To test the installations, please try:

# Verify conda installation
conda -h

# Verify nextflow installation
nextflow -h
  1. (Optional) Installing external pipelines as validation

After following these installation instructions for some of the sequencing technologies, see table below, users can run external pipelines simultaneaously with the BCA pre-processing pipeline. Depending on the sequencing technique, you only have to provide the path to the installation as within the conf/custom_parameters.config file or a boolean flag to enable/disable the execution, see Setup explenation below, and it will automatically start.

This option is limited to the following sequencing technologies:

Sequencing technology External pipeline Requires manual installation
10x Genomics 10x Genomics Cell Ranger
OAKseq 10x Genomics Cell Ranger
Ultima Genomics 10x Genomics Cell Ranger
Parse Biosciences split-pipe ✔️
BD-Rhapsody BD Rhapsody™ Sequence Analysis Pipeline ✔️

Setup

1. Create a samplesheet

The samplesheet should be a comma-seperated file, specifying the names and locations of the files and details necessary for pipeline execution. Depending on the chosen sequencing technique the order of the FASTQ files is altered, R1 might contain the cDNA while in other cases this might contain the Cell barcode & UMI's, check the available documentation or do a manual inspection.

When analysing sci-RNA-seq3 data, it is necessary to also provide the index to the p5, p7 and rt's for each sample to analyse this data successfully. These indexes are defined in the sci-RNA-seq3 barcode whitelist file, where you can match the sequences to the corresponding barcode, which is used in the samplesheet. It is also important to note that for sci-RNA-seq3 data, the p5 + p7 should be present in the headers of the FASTQ files, see this bcl2fastq script.

In the case of Parse Biosciences data, the rt column should be filled with the group-well definitions, where:

Wells are specified in blocks, ranges, or individually like this:
>    'A1:C6' specifies a block as [top-left]:[bottom-right]; A1-A6, B1-B6, C1-C6.
>    'A1-B6' specifies a range as [start]-[end]; A1-A12, B1-6.
>    'C4' specifies a single well.
Multiple selections are joined by commas (no space), e.g. 'A1-A6,B1:D3,C4'

In the table below, the available variables are summarized:

Variable Required/Optional Description
sample Required Must be unique unless you want the FASTQ files to be merged after demultiplexing.
fastq_cDNA Required Path to the FASTQ file containing cDNA
fastq_CB_UMI Required Path to the FASTQ file containing the cell barcode & UMI
fastq_indices Optional Path to the FASTQ index file(s), to provide both I1 and I2, use an asterisk to the path like /path/name_I*
expected_cells Required Number of expected cells
p5 Optional Only required for sci-RNA-seq3
p7 Optional Only required for sci-RNA-seq3
rt Optional Only required for sci-RNA-seq3 & Parse Biosciences (group-well definition)

To illustrate how the samplesheet would be filled across the different sequencing techniques, the table below is given, as well as an example samplesheet published here. The names within the parenthesis of the p5 and rt column indicate the official names, often referenced like this in the official documentation of this protocol.

sample fastq_cDNA fastq_BC_UMI fastq_indices expected_cells p5 p7 rt
sciRNAseq3_example R2 R1 expected_cells p5 p7 rt
bd_rhapsody_example R2 R1 expected_cells
parse_biosciences_example R1 R2 expected_cells rt (wells)
10xv3_example R2 R1 expected_cells
oak_seq_example R2 R1 expected_cells
ultima_genomics_example R2 R1 expected_cells

2. Edit (or create) a custom configuration file

To set the custom parameters for each run, the easiest solution is to fill in the fiels in conf/custom_parameters.config. This custom configuration file extends the general nextflow.config file in the base of this repository. conf/custom_parameters.config contains the minimum variables in order to run this pipeline, and are described in the table below in more detail.

To customize the run, you can add other (optional) variables to the conf/custom_parameters.config file. If you have a multi-species experiment, one configuration file per species must be created in order to analyze the data with the corresponding genome annotation files.

Within each custom configuration file the following variables can be defined:

Variable Required/Optional Description
input Required Path to the samplesheet.
outdir Required Path to the results/output directory; must exist before running.
protocol Required Specifies the sequencing technology used (must be one of the following: "oak_seq", "10xv1", "10xv2", "10xv3", "10xv4", "parse_biosciences_WT_mini" or "parse_biosciences_WT", "bd_rhapsody", "sciRNAseq3" , "ultima_genomics" or "seqspec").
bc_whitelist Required Path or link to the barcode whitelist file(s). If it's a link, it will be automatically downloaded and unzipped if applicable.
ref_fasta Required Path to the genome FASTA file used for mapping reads.
ref_gtf Required Path to the GTF/GFF file formatted for STARsolo.
run_method Optional Method of running the pre-processing pipeline, demonstrated in the pipeline diagram, currently either "standard", "geneext_only" or "exteral_pipeline_only". Default is set to "standard".
perform_demultiplexing Optional Boolean flag to enable or disable demultiplexing of the FASTQ files, where applicable. Default is true.
seqspec_file Optional Path to the seqspec file.
mapping_software Optional Software used to map reads (must be one of the following: "starsolo", "alevin" or "both"). Default set to "starsolo".
perform_geneext Optional Boolean flag to enable or disable the gene extension step in preprocessing. Default is false.
perform_featurecounts Optional Boolean flag to enable or disable calculation of mtDNA & rRNA percentages. Default is false.
perform_kraken Optional Boolean flag to enable or disable Kraken2 classification of unmapped reads. Default is false.

To modify the behaviour of certain processes or enable external pipelines, additional variables can be added to the configuration file. An overview of the extended custom parameters is listed here.


Usage

Pre-requisites:

Running the Pipeline

Warning

The pipeline must be run in the conda base environment, it cannot activate the different environments properly with a prior environment activated. It should have access to run both nextflow and conda in the commandline.


nextflow run
    -profile <institution_config>,conda
    -c </path/to/custom_parameters.config>
    -w </path/to/workdir>

# OR - submitting pipeline through a bash script
sbatch submit_nextflow.sh main.nf

Output

The pipeline outputs are organized into sub-directories based on the analysis steps performed. Below is an overview of the possible directory structure:

output_directory/
├── pipeline_info/          # Run execution details (configs, logs, samplesheets)
├── fastqc/                 # Quality control reports for raw FASTQ files
├── mapping_STARsolo/       # STARsolo count matrices and stats
├── mapping_alevin/         # Alevin-fry count matrices and AlevinQC
├── summary_results/        # Aggregated QC report
│
├── cellbender/             # (Optional) CellBender filtered matrices
├── saturation/             # (Optional) Sequencing saturation analysis
├── gene_ext/               # (Optional) Extended gene annotations
├── rRNA_mtDNA/             # (Optional) mtDNA and rRNA statistics
├── kraken/                 # (Optional) Kraken2 taxonomic classification
│
├── CellRanger_pipeline/    # (Optional) External Cell Ranger outputs
├── ParseBio_pipeline/      # (Optional) External Split-pipe outputs
└── BDrhapsody_pipeline/    # (Optional) External BD Rhapsody outputs

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.