-
Notifications
You must be signed in to change notification settings - Fork 0
Methods
This document provides an overview of nimble's alignment method, detailing how sequencing reads are processed, filtered, and assigned to features.
The nimble alignment pipeline efficiently maps sequencing reads to a reference library while enforcing multiple layers of filtration. The core method leverages pseudoalignment to assign features to reads, with additional filters to improve specificity and accuracy.
nimble accepts single or paired-end sequencing reads in FASTQ.GZ or BAM formats. The processing pipeline is optimized based on the input type:
-
FASTQ.GZ Mode: Reads are loaded into memory and aligned in bulk.
-
BAM Mode: Reads are processed in UMI-based batches, aligning multiple reads per UMI simultaneously.
For BAM inputs, an additional preprocessing step sorts the data by UMI and cell barcode using samtools.
The reference library defines the features against which reads are aligned. Each feature is stored in the index alongside its reverse complement, allowing nimble to provide optional filters based on alignment orientation. Reads are trimmed before alignment, ensuring they meet length and quality thresholds.
Reads are then aligned to the reference using pseudoalignment, which generates either a single feature call or an equivalence class—a set of features with the same alignment score. A read's alignments then go through several layers of filtration.
Before alignment, each read is subjected to several quality checks:
-
Minimum Length Requirement: Reads below a defined length threshold are discarded.
-
Entropy Filtering: Low-complexity sequences are removed to prevent non-informative alignments.
-
Quality-Based Trimming: Reads are trimmed to improve alignment accuracy while maintaining high-confidence bases.
Once aligned, reads undergo additional filtration:
-
Mismatch Filtering: Reads with excessive mismatches to reference sequences are discarded.
-
Score Thresholds: Alignments must surpass defined absolute and normalized score thresholds.
-
Multi-Match Filtering: Reads aligning to many unrelated features may optionally be excluded.
Since the reference library includes both the original sequence and its reverse complement, orientation constraints are enforced to improve specificity:
-
Consistent Read-Pair Orientation: For paired-end data, the expected alignment pattern between mates is evaluated.
-
Strand-Specific Filtering: Reads aligning to incorrect strand orientations are discarded based on library chemistry, controlled by the
--strand_filterCLI parameter.
For ambiguous alignments, nimble determines the final feature call using one of several configurable resolution strategies:
-
All Matches: Includes all aligned features.
-
Intersection with Fallback: Filters ambiguous alignments by requiring feature overlap, with a fallback to permissive inclusion.
-
Strict Intersection: Requires strong agreement between read and mate alignments; otherwise, the read is discarded.
After filtering, feature calls are aggregated for output. The results depend on the input format:
-
FASTQ.GZ Mode: Produces a dataset-wide alignment file.
-
BAM Mode: Outputs per-UMI feature assignments.
For single-cell RNA sequencing (scRNA-seq), nimble provides a reporting step that transforms per-UMI assignments into cell-by-feature matrices. This involves:
-
UMI-Level Intersection: Identifying common feature calls across all reads within a UMI.
-
Cell-Level Aggregation: Counting unique UMIs per feature to build a cell-feature expression matrix.
-
Ambiguity resolution by filtering out features that are only present within a configurable ratio of the reads within a given UMI.
This matrix is structured for integration with downstream tools like Seurat.