This nextflow pipeline is designed to pre-process single-cell and single-nucleus RNA-seq data. It accepts FASTQ files from multiple sequencing platforms, at the moment being:
- Parse Biosciences
- BD Rhapsody
- 10x Genomics
- OAK-seq
- Ultima Genomics
- sci-RNA-seq3
- and others, when providing a seqspec file
Depending on the chosen sequencing technique, it handles the processing of the FASTQ files accordingly. Whenever possible, we compared our results to a commercial pre-processing pipeline for that sequencing technique. For example, comparing our Parse Biosciences results to the official split-pipe pipeline from Parse Biosciences. While we cannot provide this commercial software directly, you can install it yourself (e.g. by following these instructions), and provide a path where the installation is located in the configuration file. This way, it will be executed alongside of the BCA pre-processing pipeline.
The pipeline will produce the following output files:
- Quality control reports for raw FASTQ files
- Raw & Filtered count matrices (exonic, exonic & intronic) from STARsolo and/or Alevin-fry
- (Optional) mtDNA and rRNA statistics
- (Optional) Sequencing saturation analysis
- (Optional) Extended gene annotation file
- (Optional) Filtered count matrices (h5 files) from CellBender step
- (Optional) Kraken2 taxonomic classification
- MultiQC report
- Download the latest release
Go to the Releases page and download the .zip or .tar.gz file for the version you want.
- Unpack the files
unzip bca_preprocessing-<version>.zip
cd bca_preprocessing-<version>
or for tar.gz:
tar -xzf bca_preprocessing-<version>.tar.gz
cd bca_preprocessing-<version>
- Download submodules
The pipeline uses two submodules, 10x_saturate to plot the saturation curve and GeneExt for an extended gene annotation file. These are not included automatically, so have to be installed explicitly by running:
bash fetch_submodules.sh
- Conda & Nextflow
In order to run the pipeline, you must have Conda and Nextflow installed. When working on a HPC, there might be a module system available to use instead, simplifying the use of different software. To see which modules are available and how to load them:
# List currently loaded modules
module list
# Check for available Nextflow modules
module avail Nextflow
# Load Nextflow module (make sure to change version accordingly!)
module load Nextflow/24.04.3
To test the installations, please try:
# Verify conda installation
conda -h
# Verify nextflow installation
nextflow -h
- (Optional) Installing external pipelines as validation
After following these installation instructions for some of the sequencing technologies, see table below, users can run external pipelines simultaneaously with the BCA pre-processing pipeline. Depending on the sequencing technique, you only have to provide the path to the installation as within the conf/custom_parameters.config file or a boolean flag to enable/disable the execution, see Setup explenation below, and it will automatically start.
This option is limited to the following sequencing technologies:
| Sequencing technology | External pipeline | Requires manual installation |
|---|---|---|
| 10x Genomics | 10x Genomics Cell Ranger | ❌ |
| OAKseq | 10x Genomics Cell Ranger | ❌ |
| Ultima Genomics | 10x Genomics Cell Ranger | ❌ |
| Parse Biosciences | split-pipe | ✔️ |
| BD-Rhapsody | BD Rhapsody™ Sequence Analysis Pipeline | ✔️ |
The samplesheet should be a comma-seperated file, specifying the names and locations of the files and details necessary for pipeline execution. Depending on the chosen sequencing technique the order of the FASTQ files is altered, R1 might contain the cDNA while in other cases this might contain the Cell barcode & UMI's, check the available documentation or do a manual inspection.
When analysing sci-RNA-seq3 data, it is necessary to also provide the index to the p5, p7 and rt's for each sample to analyse this data successfully. These indexes are defined in the sci-RNA-seq3 barcode whitelist file, where you can match the sequences to the corresponding barcode, which is used in the samplesheet. It is also important to note that for sci-RNA-seq3 data, the p5 + p7 should be present in the headers of the FASTQ files, see this bcl2fastq script.
In the case of Parse Biosciences data, the rt column should be filled with the group-well definitions, where:
Wells are specified in blocks, ranges, or individually like this:
> 'A1:C6' specifies a block as [top-left]:[bottom-right]; A1-A6, B1-B6, C1-C6.
> 'A1-B6' specifies a range as [start]-[end]; A1-A12, B1-6.
> 'C4' specifies a single well.
Multiple selections are joined by commas (no space), e.g. 'A1-A6,B1:D3,C4'
In the table below, the available variables are summarized:
| Variable | Required/Optional | Description |
|---|---|---|
| sample | Required | Must be unique unless you want the FASTQ files to be merged after demultiplexing. |
| fastq_cDNA | Required | Path to the FASTQ file containing cDNA |
| fastq_CB_UMI | Required | Path to the FASTQ file containing the cell barcode & UMI |
| fastq_indices | Optional | Path to the FASTQ index file(s), to provide both I1 and I2, use an asterisk to the path like /path/name_I* |
| expected_cells | Required | Number of expected cells |
| p5 | Optional | Only required for sci-RNA-seq3 |
| p7 | Optional | Only required for sci-RNA-seq3 |
| rt | Optional | Only required for sci-RNA-seq3 & Parse Biosciences (group-well definition) |
To illustrate how the samplesheet would be filled across the different sequencing techniques, the table below is given, as well as an example samplesheet published here. The names within the parenthesis of the p5 and rt column indicate the official names, often referenced like this in the official documentation of this protocol.
| sample | fastq_cDNA | fastq_BC_UMI | fastq_indices | expected_cells | p5 | p7 | rt |
|---|---|---|---|---|---|---|---|
| sciRNAseq3_example | R2 | R1 | expected_cells | p5 | p7 | rt | |
| bd_rhapsody_example | R2 | R1 | expected_cells | ||||
| parse_biosciences_example | R1 | R2 | expected_cells | rt (wells) | |||
| 10xv3_example | R2 | R1 | expected_cells | ||||
| oak_seq_example | R2 | R1 | expected_cells | ||||
| ultima_genomics_example | R2 | R1 | expected_cells |
To set the custom parameters for each run, the easiest solution is to fill in the fiels in conf/custom_parameters.config. This custom configuration file extends the general nextflow.config file in the base of this repository. conf/custom_parameters.config contains the minimum variables in order to run this pipeline, and are described in the table below in more detail.
To customize the run, you can add other (optional) variables to the conf/custom_parameters.config file. If you have a multi-species experiment, one configuration file per species must be created in order to analyze the data with the corresponding genome annotation files.
Within each custom configuration file the following variables can be defined:
| Variable | Required/Optional | Description |
|---|---|---|
input |
Required | Path to the samplesheet. |
outdir |
Required | Path to the results/output directory; must exist before running. |
protocol |
Required | Specifies the sequencing technology used (must be one of the following: "oak_seq", "10xv1", "10xv2", "10xv3", "10xv4", "parse_biosciences_WT_mini" or "parse_biosciences_WT", "bd_rhapsody", "sciRNAseq3" , "ultima_genomics" or "seqspec"). |
bc_whitelist |
Required | Path or link to the barcode whitelist file(s). If it's a link, it will be automatically downloaded and unzipped if applicable. |
ref_fasta |
Required | Path to the genome FASTA file used for mapping reads. |
ref_gtf |
Required | Path to the GTF/GFF file formatted for STARsolo. |
run_method |
Optional | Method of running the pre-processing pipeline, demonstrated in the pipeline diagram, currently either "standard", "geneext_only" or "exteral_pipeline_only". Default is set to "standard". |
perform_demultiplexing |
Optional | Boolean flag to enable or disable demultiplexing of the FASTQ files, where applicable. Default is true. |
seqspec_file |
Optional | Path to the seqspec file. |
mapping_software |
Optional | Software used to map reads (must be one of the following: "starsolo", "alevin" or "both"). Default set to "starsolo". |
perform_geneext |
Optional | Boolean flag to enable or disable the gene extension step in preprocessing. Default is false. |
perform_featurecounts |
Optional | Boolean flag to enable or disable calculation of mtDNA & rRNA percentages. Default is false. |
perform_kraken |
Optional | Boolean flag to enable or disable Kraken2 classification of unmapped reads. Default is false. |
To modify the behaviour of certain processes or enable external pipelines, additional variables can be added to the configuration file. An overview of the extended custom parameters is listed here.
- Created a samplesheet in CSV format (see conf/example_samplesheet.csv)
- Edited (
conf/custom_parameters.config) or created a new custom configuration file - Conda & Nextflow available in base environment
Warning
The pipeline must be run in the conda base environment, it cannot activate the different environments properly with a prior environment activated. It should have access to run both nextflow and conda in the commandline.
nextflow run
-profile <institution_config>,conda
-c </path/to/custom_parameters.config>
-w </path/to/workdir>
# OR - submitting pipeline through a bash script
sbatch submit_nextflow.sh main.nf
The pipeline outputs are organized into sub-directories based on the analysis steps performed. Below is an overview of the possible directory structure:
output_directory/
├── pipeline_info/ # Run execution details (configs, logs, samplesheets)
├── fastqc/ # Quality control reports for raw FASTQ files
├── mapping_STARsolo/ # STARsolo count matrices and stats
├── mapping_alevin/ # Alevin-fry count matrices and AlevinQC
├── summary_results/ # Aggregated QC report
│
├── cellbender/ # (Optional) CellBender filtered matrices
├── saturation/ # (Optional) Sequencing saturation analysis
├── gene_ext/ # (Optional) Extended gene annotations
├── rRNA_mtDNA/ # (Optional) mtDNA and rRNA statistics
├── kraken/ # (Optional) Kraken2 taxonomic classification
│
├── CellRanger_pipeline/ # (Optional) External Cell Ranger outputs
├── ParseBio_pipeline/ # (Optional) External Split-pipe outputs
└── BDrhapsody_pipeline/ # (Optional) External BD Rhapsody outputs
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
