BCA Pre-Processing Pipeline

Overview

This nextflow pipeline is designed to pre-process single-cell and single-nucleus RNA-seq data. It accepts FASTQ files from multiple sequencing platforms, at the moment being:

Parse Biosciences
BD Rhapsody
10x Genomics
OAK-seq
Ultima Genomics
sci-RNA-seq3
and others, when providing a seqspec file

Depending on the chosen sequencing technique, it handles the processing of the FASTQ files accordingly. Whenever possible, we compared our results to a commercial pre-processing pipeline for that sequencing technique. For example, comparing our Parse Biosciences results to the official split-pipe pipeline from Parse Biosciences. While we cannot provide this commercial software directly, you can install it yourself (e.g. by following these instructions), and provide a path where the installation is located in the configuration file. This way, it will be executed alongside of the BCA pre-processing pipeline.

The pipeline will produce the following output files:

Quality control reports for raw FASTQ files
Raw & Filtered count matrices (exonic, exonic & intronic) from STARsolo and/or Alevin-fry
(Optional) mtDNA and rRNA statistics
(Optional) Sequencing saturation analysis
(Optional) Extended gene annotation file
(Optional) Filtered count matrices (h5 files) from CellBender step
(Optional) Kraken2 taxonomic classification
MultiQC report

Installation

Download the latest release

Go to the Releases page and download the .zip or .tar.gz file for the version you want.

Unpack the files

unzip bca_preprocessing-<version>.zip
cd bca_preprocessing-<version>

or for tar.gz:

tar -xzf bca_preprocessing-<version>.tar.gz
cd bca_preprocessing-<version>

Download submodules

The pipeline uses two submodules, 10x_saturate to plot the saturation curve and GeneExt for an extended gene annotation file. These are not included automatically, so have to be installed explicitly by running:

bash fetch_submodules.sh

Conda & Nextflow

In order to run the pipeline, you must have Conda and Nextflow installed. When working on a HPC, there might be a module system available to use instead, simplifying the use of different software. To see which modules are available and how to load them:

# List currently loaded modules
module list

# Check for available Nextflow modules
module avail Nextflow

# Load Nextflow module (make sure to change version accordingly!)
module load Nextflow/24.04.3

To test the installations, please try:

# Verify conda installation
conda -h

# Verify nextflow installation
nextflow -h

(Optional) Installing external pipelines as validation

After following these installation instructions for some of the sequencing technologies, see table below, users can run external pipelines simultaneaously with the BCA pre-processing pipeline. Depending on the sequencing technique, you only have to provide the path to the installation as within the conf/custom_parameters.config file or a boolean flag to enable/disable the execution, see Setup explenation below, and it will automatically start.

This option is limited to the following sequencing technologies:

Sequencing technology	External pipeline	Requires manual installation
10x Genomics	10x Genomics Cell Ranger	❌
OAKseq	10x Genomics Cell Ranger	❌
Ultima Genomics	10x Genomics Cell Ranger	❌
Parse Biosciences	split-pipe	✔️
BD-Rhapsody	BD Rhapsody™ Sequence Analysis Pipeline	✔️

Setup

1. Create a samplesheet

The samplesheet should be a comma-seperated file, specifying the names and locations of the files and details necessary for pipeline execution. Depending on the chosen sequencing technique the order of the FASTQ files is altered, R1 might contain the cDNA while in other cases this might contain the Cell barcode & UMI's, check the available documentation or do a manual inspection.

When analysing sci-RNA-seq3 data, it is necessary to also provide the index to the p5, p7 and rt's for each sample to analyse this data successfully. These indexes are defined in the sci-RNA-seq3 barcode whitelist file, where you can match the sequences to the corresponding barcode, which is used in the samplesheet. It is also important to note that for sci-RNA-seq3 data, the p5 + p7 should be present in the headers of the FASTQ files, see this bcl2fastq script.

In the case of Parse Biosciences data, the rt column should be filled with the group-well definitions, where:

Wells are specified in blocks, ranges, or individually like this:
>    'A1:C6' specifies a block as [top-left]:[bottom-right]; A1-A6, B1-B6, C1-C6.
>    'A1-B6' specifies a range as [start]-[end]; A1-A12, B1-6.
>    'C4' specifies a single well.
Multiple selections are joined by commas (no space), e.g. 'A1-A6,B1:D3,C4'

In the table below, the available variables are summarized:

Variable	Required/Optional	Description
sample	Required	Must be unique unless you want the FASTQ files to be merged after demultiplexing.
fastq_cDNA	Required	Path to the FASTQ file containing cDNA
fastq_CB_UMI	Required	Path to the FASTQ file containing the cell barcode & UMI
fastq_indices	Optional	Path to the FASTQ index file(s), to provide both I1 and I2, use an asterisk to the path like /path/name_I*
expected_cells	Required	Number of expected cells
p5	Optional	Only required for sci-RNA-seq3
p7	Optional	Only required for sci-RNA-seq3
rt	Optional	Only required for sci-RNA-seq3 & Parse Biosciences (group-well definition)

To illustrate how the samplesheet would be filled across the different sequencing techniques, the table below is given, as well as an example samplesheet published here. The names within the parenthesis of the p5 and rt column indicate the official names, often referenced like this in the official documentation of this protocol.

sample	fastq_cDNA	fastq_BC_UMI	expected_cells	p5	p7	rt
sciRNAseq3_example	R2	R1	expected_cells	p5	p7	rt
bd_rhapsody_example	R2	R1	expected_cells
parse_biosciences_example	R1	R2	expected_cells			rt (wells)
10xv3_example	R2	R1	expected_cells
oak_seq_example	R2	R1	expected_cells
ultima_genomics_example	R2	R1	expected_cells

2. Edit (or create) a custom configuration file

To set the custom parameters for each run, the easiest solution is to fill in the fiels in conf/custom_parameters.config. This custom configuration file extends the general nextflow.config file in the base of this repository. conf/custom_parameters.config contains the minimum variables in order to run this pipeline, and are described in the table below in more detail.

To customize the run, you can add other (optional) variables to the conf/custom_parameters.config file. If you have a multi-species experiment, one configuration file per species must be created in order to analyze the data with the corresponding genome annotation files.

Within each custom configuration file the following variables can be defined:

Variable	Required/Optional	Description
`input`	Required	Path to the samplesheet.
`outdir`	Required	Path to the results/output directory; must exist before running.
`protocol`	Required	Specifies the sequencing technology used (must be one of the following: `"oak_seq"`, `"10xv1"`, `"10xv2"`, `"10xv3"`, `"10xv4"`, `"parse_biosciences_WT_mini"` or `"parse_biosciences_WT"`, `"bd_rhapsody"`, `"sciRNAseq3"` , `"ultima_genomics"` or `"seqspec"`).
`bc_whitelist`	Required	Path or link to the barcode whitelist file(s). If it's a link, it will be automatically downloaded and unzipped if applicable.
`ref_fasta`	Required	Path to the genome FASTA file used for mapping reads.
`ref_gtf`	Required	Path to the GTF/GFF file formatted for STARsolo.
`run_method`	Optional	Method of running the pre-processing pipeline, demonstrated in the pipeline diagram, currently either `"standard"`, `"geneext_only"` or `"exteral_pipeline_only"`. Default is set to `"standard"`.
`perform_demultiplexing`	Optional	Boolean flag to enable or disable demultiplexing of the FASTQ files, where applicable. Default is `true`.
`seqspec_file`	Optional	Path to the seqspec file.
`mapping_software`	Optional	Software used to map reads (must be one of the following: `"starsolo"`, `"alevin"` or `"both"`). Default set to `"starsolo"`.
`perform_geneext`	Optional	Boolean flag to enable or disable the gene extension step in preprocessing. Default is `false`.
`perform_featurecounts`	Optional	Boolean flag to enable or disable calculation of mtDNA & rRNA percentages. Default is `false`.
`perform_kraken`	Optional	Boolean flag to enable or disable Kraken2 classification of unmapped reads. Default is `false`.

To modify the behaviour of certain processes or enable external pipelines, additional variables can be added to the configuration file. An overview of the extended custom parameters is listed here.

Usage

Pre-requisites:

Created a samplesheet in CSV format (see conf/example_samplesheet.csv)
Edited (conf/custom_parameters.config) or created a new custom configuration file
Conda & Nextflow available in base environment

Running the Pipeline

Warning

The pipeline must be run in the conda base environment, it cannot activate the different environments properly with a prior environment activated. It should have access to run both nextflow and conda in the commandline.


nextflow run
    -profile <institution_config>,conda
    -c </path/to/custom_parameters.config>
    -w </path/to/workdir>

# OR - submitting pipeline through a bash script
sbatch submit_nextflow.sh main.nf

Output

The pipeline outputs are organized into sub-directories based on the analysis steps performed. Below is an overview of the possible directory structure:

output_directory/
├── pipeline_info/          # Run execution details (configs, logs, samplesheets)
├── fastqc/                 # Quality control reports for raw FASTQ files
├── mapping_STARsolo/       # STARsolo count matrices and stats
├── mapping_alevin/         # Alevin-fry count matrices and AlevinQC
├── summary_results/        # Aggregated QC report
│
├── cellbender/             # (Optional) CellBender filtered matrices
├── saturation/             # (Optional) Sequencing saturation analysis
├── gene_ext/               # (Optional) Extended gene annotations
├── rRNA_mtDNA/             # (Optional) mtDNA and rRNA statistics
├── kraken/                 # (Optional) Kraken2 taxonomic classification
│
├── CellRanger_pipeline/    # (Optional) External Cell Ranger outputs
├── ParseBio_pipeline/      # (Optional) External Split-pipe outputs
└── BDrhapsody_pipeline/    # (Optional) External BD Rhapsody outputs

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
img		img
modules/local		modules/local
submodules		submodules
subworkflows		subworkflows
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
fetch_submodules.sh		fetch_submodules.sh
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
submit_nextflow.sh		submit_nextflow.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BCA Pre-Processing Pipeline

Table of Contents

Overview

Installation

Setup

1. Create a samplesheet

2. Edit (or create) a custom configuration file

Usage

Pre-requisites:

Running the Pipeline

Output

Citations

About

Uh oh!

Releases

Packages

Languages

License

biodiversitycellatlas/bca_preprocessing

Folders and files

Latest commit

History

Repository files navigation

BCA Pre-Processing Pipeline

Table of Contents

Overview

Installation

Setup

1. Create a samplesheet

2. Edit (or create) a custom configuration file

Usage

Pre-requisites:

Running the Pipeline

Output

Citations

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages