From 61e525b7abf94d9d4e52faeeb9467409105121d1 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 30 Dec 2025 01:02:31 +0000 Subject: [PATCH 1/3] Initial plan From f0b46ab377626a23732613782a81ed96c35be959 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 30 Dec 2025 01:05:56 +0000 Subject: [PATCH 2/3] Add documentation on salmon ISR vs auto mode Co-authored-by: lhhunghimself <10456425+lhhunghimself@users.noreply.github.com> --- salmon_library_types.md | 135 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 salmon_library_types.md diff --git a/salmon_library_types.md b/salmon_library_types.md new file mode 100644 index 0000000..1c2eb18 --- /dev/null +++ b/salmon_library_types.md @@ -0,0 +1,135 @@ +# Salmon Library Type: ISR vs Auto Mode + +## Overview + +Salmon and related tools (alevin, alevin-fry, piscem) use library type specifications to understand the orientation and strandedness of sequencing reads. This is critical for accurate quantification in RNA-seq experiments. + +## Library Type Format + +In salmon, library types follow the format: `` + +For single-cell RNA-seq with alevin/simpleaf, the commonly used parameter is `--expected-ori` which simplifies this specification. + +## ISR Mode (Inward Stranded Reverse) + +**ISR** stands for **Inward, Stranded, Reverse-complement**. + +### Characteristics: +- **I (Inward)**: Paired-end reads face inward toward each other (most common for Illumina paired-end sequencing) +- **S (Stranded)**: The library is strand-specific (preserves information about which DNA strand was transcribed) +- **R (Reverse)**: The first read (R1) comes from the reverse complement of the transcript + +### Behavior: +- Salmon explicitly expects reads to match the ISR orientation pattern +- Reads that don't match this pattern may be incorrectly quantified or discarded +- This is the standard for many stranded RNA-seq protocols including: + - Illumina TruSeq Stranded mRNA + - 10x Chromium single-cell RNA-seq (uses `fw` which is equivalent for single-end barcode reads) + - dUTP-based stranded protocols + +### Use Case: +Use ISR when you **know** your library is stranded and follows the reverse-complement orientation. This provides: +- ✅ More accurate quantification +- ✅ Strand-specific information +- ✅ Better handling of overlapping genes on opposite strands +- ❌ Incorrect results if library prep doesn't match + +## Auto Mode (Automatic Detection) + +**Auto** (or **A**) enables automatic library type detection. + +### Characteristics: +- Salmon examines a subset of reads at the beginning of quantification +- Infers the most likely library type from the actual data +- Falls back to the most probable orientation if uncertain + +### Behavior: +- Analyzes the first ~5 million reads (configurable) +- Tests different orientation patterns to find the best match +- Reports the detected library type in the output +- Can handle mixed or uncertain library preparations + +### Use Case: +Use auto mode when: +- ✅ Library type is unknown or uncertain +- ✅ Testing new protocols or library prep methods +- ✅ Working with data from external sources +- ✅ Prefer robustness over strict assumptions +- ❌ Slightly slower than explicit mode (due to detection phase) +- ⚠️ May incorrectly detect type with low-quality or unusual data + +## Key Differences Summary + +| Aspect | ISR Mode | Auto Mode | +|--------|----------|-----------| +| **Speed** | Faster (no detection phase) | Slightly slower (detection overhead) | +| **Accuracy** | Highest when library matches | Good, adapts to actual data | +| **Robustness** | Fails if library doesn't match | Handles uncertainty | +| **Use Case** | Known, standard protocols | Unknown or variable protocols | +| **Error Handling** | Silent incorrect quantification | May detect unexpected patterns | + +## Current Usage in This Repository + +In `scripts/runMultiAlign.sh`, the repository uses: + +```bash +--expected-ori fw +``` + +For single-cell RNA-seq with 10x Chromium: +- `fw` (forward) is appropriate because the biological read (R2) is in the forward orientation +- R1 contains the cell barcode + UMI (technical read) +- R2 contains the actual cDNA sequence (biological read) + +This is functionally similar to specifying the strand orientation for the quantification step. + +## Recommendations + +### Use ISR (or fw for single-cell) when: +1. Working with **standard 10x Chromium single-cell RNA-seq** +2. Library prep protocol is **well-documented** and confirmed +3. Processing **large production datasets** where speed matters +4. Protocol has been **validated** on test samples + +### Use Auto mode when: +1. **Protocol is uncertain** or from external sources +2. **Testing new library prep methods** +3. **Troubleshooting** quantification issues +4. Working with **mixed samples** or unusual prep +5. Want to **verify** the library orientation assumption + +## Checking Library Type + +After running salmon/alevin with auto mode, check the log file for: +``` +Library format: ISR +``` + +This confirms what salmon detected and can inform future runs with explicit mode. + +## References + +- [Salmon Documentation - Library Type](https://salmon.readthedocs.io/en/latest/library_type.html) +- [Alevin-fry Documentation](https://alevin-fry.readthedocs.io/) +- [10x Genomics Library Structure](https://www.10xgenomics.com/) + +## Impact on This Pipeline + +For the `runMultiAlign.sh` script: +- Current implementation uses `--expected-ori fw` (explicit mode) +- This is correct for standard 10x single-cell RNA-seq +- To enable auto-detection, change to `--expected-ori both` or omit the parameter (depending on tool version) + +**Example modification for auto mode:** +```bash +# Current (explicit): +--expected-ori fw --unfiltered-pl + +# Auto mode option 1: +--expected-ori both --unfiltered-pl + +# Auto mode option 2 (if supported): +--expected-ori auto --unfiltered-pl +``` + +Note: The specific syntax may vary between salmon, piscem, and simpleaf versions. Always consult the tool's documentation for the version you're using. From ad4008d992d30abdf6dd9601fb4eeae1e1a91f9f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 30 Dec 2025 01:06:53 +0000 Subject: [PATCH 3/3] Improve salmon documentation with tool-specific guidance Co-authored-by: lhhunghimself <10456425+lhhunghimself@users.noreply.github.com> --- salmon_library_types.md | 34 +++++++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 7 deletions(-) diff --git a/salmon_library_types.md b/salmon_library_types.md index 1c2eb18..69d6483 100644 --- a/salmon_library_types.md +++ b/salmon_library_types.md @@ -113,23 +113,43 @@ This confirms what salmon detected and can inform future runs with explicit mode - [Alevin-fry Documentation](https://alevin-fry.readthedocs.io/) - [10x Genomics Library Structure](https://www.10xgenomics.com/) -## Impact on This Pipeline +## Implementation in runMultiAlign.sh For the `runMultiAlign.sh` script: - Current implementation uses `--expected-ori fw` (explicit mode) - This is correct for standard 10x single-cell RNA-seq -- To enable auto-detection, change to `--expected-ori both` or omit the parameter (depending on tool version) +- To enable auto-detection, the syntax varies by tool -**Example modification for auto mode:** +**Important:** The specific syntax for auto-detection depends on the tool version. Always consult the documentation for your specific tool version before making changes. + +### Tool-Specific Auto Mode Syntax + +**For simpleaf (recommended for alevin-fry):** ```bash # Current (explicit): --expected-ori fw --unfiltered-pl -# Auto mode option 1: +# Auto mode (simpleaf): --expected-ori both --unfiltered-pl +``` + +**For direct salmon-alevin usage:** +```bash +# Check your salmon version documentation +# Older versions: may not support auto +# Newer versions: typically support 'A' or 'both' +``` -# Auto mode option 2 (if supported): ---expected-ori auto --unfiltered-pl +**For piscem:** +```bash +# Piscem typically follows salmon conventions +# Consult piscem documentation for your version ``` -Note: The specific syntax may vary between salmon, piscem, and simpleaf versions. Always consult the tool's documentation for the version you're using. +### Verification Steps + +Before changing to auto mode in production: +1. Test on a small subset of data first +2. Check the output logs to confirm detected library type matches expectations +3. Compare quantification results between explicit and auto modes +4. Ensure detected type is consistently reported across samples