scTagger matches barcodes of short- and long-reads of single-cell RNA-seq experiments to enable relating at the cell level gene expression (from short-reads) and RNA splicing (from the long-reads).
scTagger is available as a Conda package:
conda create -n sctagger-env -c bioconda sctagger
conda activate sctagger-env
scTagger.py -hWe provided a simple Snakefile alongside a config.yaml file that runs the three stages of scTagger as well as Cell Ranger (assumes Cell Ranger is in path).
scTagger has a single python script containing different functions to match long-reads and short-reads barcodes.
The whole pipeline contains three steps that you can run each part separately:
The first step of the scTagger pipeline is to extract a segment where the probability of seeing a barcode is more than in other places. To run this step, you can use the following command.
./scTagger.py extract_lr_bc -r "path/to/long/read/fastq" -o "path/to/output/file" -p "path/to/output/plots"
Augments
-r: Space separated paths to reads in FASTQ-g: Space separated of the ranges of where SR adapter should be found on the LR's (Optional, Default: Detect from data)-z: Indicate input is gzipped (Optional, Default: Assume input is gzipped if it ends with ".gz")-t: Number of threads (Optional, Default: 1)-sa: Short-read adapter (Optional, Default:CTACACGACGCTCTTCCGATCT)--num-bp-afte: Number of bases after the end of the SR adapter alignment to generate (Optional, Default: 20)-o: Path to output file-p: Path to plot file (Optional, Default: No plotting)
Inputs
- A list of FASTQ files of long-reads
Outputs
- A Tsv file:
- First column is read-id
- Second column is the best edit distance with the short-read adapter
- Third column is the starting point of long-read that matches with the adapter
- Fourth column is the long-read segment that find.
- A plot of optimal alignment locations of the short read adapter to the long-reads.
The second step is to extract the top short-reads barcodes that cover most of the reads.
./scTagger.py extract_sr_bc -i "path/to/bam/file" -o "path/to/output/file" -p "path/to/output/plot"
Arguments
-i: Input file-o: Path to output file.-p: Path to plot file (Optional, Default: No plotting)--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)
Input
- A bam file of short reads data
Output
- A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
- A cumulative plot of SR coverage with batches of 1,000 barcodes
This is an alternative to the second step which avoids using the short-reads all together and inteads builds a whiltelist of cellular barcodes from the long-reads directly.
This is done by looking for exact matches of the 10x Chromium list of cellular barcodes on the long-read barcode segments.
The barcodes are sorted by frequency and the most frequent barcodes are kept using the strategy as the extract_sr_bc module.
./scTagger.py extract_sr_bc_from_lr -i "path/to/long-read-segments" -wl "/path/to/10x-barcode-list.txt" -o "path/to/output.txt"'
Arguments
-i: Input TSV file containing the long-read segments file generated byextract_lr_bcstep-o: Path to output file.-wl: Path to 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz). Accepts both txt.gz files and .txt files.--thresh: Percentage theshold required per step to continue adding read barcodes (Optional, Default: 0.005)--step-size: Number of barcodes processed at a time and whose sum is used to check against the theshold (Optional, Default: 1000)--max-barcode-cnt: Max number of barcodes to keep (Optional, Default: 25000)
Input
- The output file of the
extract_lr_bcstep - 10x Genomics cellular barcode whiltelist (e.g. 3M-february-2018.txt.gz)
Output
- A TSV file
- First column is barcodes
- Second column is the number of appearances of the barcode
The last step is to match long-read segments with selected barcodes from short reads
./scTagger.py match_trie -lr "path/to/output/extract/long-read/segment" -sr "path/to/output/extract/top/short-read" -o "path/to/output/file" -t "number of threads"
Arguments
-lr: Long-read segments TSV file-sr: Short-read barcode list TSV file-mr: Maximum number of errors allowed for barcode matching (Optional, Default: 2)-m: Maximum number of GB of RAM to be used (Optional, Default: 16.0)-bl: Length of barcodes (Optional, Default: 16)-t: Number of threads to use for searching (Optional, Default: 16)-p: Path of plot file-o: Path to output file. Output file is gzipped
Inputs
- Use the output of extracting long-read segment and selecting top barcodes part as the inputs of this section
Outputs
- A TSV file
- First column is the read id
- Second column is the minimum edit distance
- Third column is the number of short reads barcodes that match with the long-read
- Fourth column is the long-read segment, and the Fifth column is a list of all short-read barcodes with minimum edit distance
- A bar plot that shows the number of long-reads by the minimum edit distance of their match barcode
scTagger was first accepted to RECOMB-seq 2022 and is now published by iScience:
Ghazal Ebrahimi, Baraa Orabi, Meghan Robinson, Cedric Chauve, Ryan Flannigan, and Faraz Hach. "Fast and accurate matching of cellular barcodes across short-and long-reads of single-cell RNA-seq experiments." iScience (2022). DOI:10.1016/j.isci.2022.104530
Please check the paper branch of this repository for the archived paper experiements and implementation.