Skip to content

Mitopozzi/RNAsign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNA Sign Logo

RNAsign Nextflow Pipeline

An automated Nextflow pipeline for end-to-end identification of functional miRNA regions from BAM files using coverage analysis, machine learning, and quantification.

Overview

This pipeline runs the complete RNAsign workflow, from sorted BAM files to the identification and quantification of new miRNAs. It works on all types of genomes and relies simply on expression patterns, ignoring sequence-specific features that would make it species-specific. It integrates Bedtools and FeatureCounts along a new set of custom scripts and models into a reproducible, scalable workflow with automatic resource management through Docker or Singularity.

The main innovation is the miRNA prediction pipeline Prediction.py, which uses a ensemble classifier on the embedding generated by a GNN from the transcriptional signatures. This approach is especially useful to generalize the ability to identify novel short functional RNAs across species, as the genetic information is not used

Also, before starting to use the pipeline, remember to unzip the models in /res .

Pipeline Architecture

BAM files
    ↓
[1] bedtools genomecov (Total, 3', 5' BedGraph coverage)
    ↓
[2] Filter.py (Extract stable regions)
    ↓
[3] Prediction.py (GNN + ML prediction)
    ↓
[4] featureCounts (Quantify predicted miRNAs)
    ↓
Count matrices

Features

  • Automated workflow from BAM to count matrices
  • Flexible execution with start/stop control at any stage
  • GPU acceleration with automatic switch between CPU and GPU
  • Container support (Docker/Singularity)
  • Parallel processing of multiple samples (depending on system resources)
  • Intermediate file caching for pipeline resumption
  • Conda environment management separate conda enviroment for each tool to avoid conflicts

The combination of GNN, Ensemble ML and meta-learner has created a model with almost perfect accuracy:

GNN attention example


Quick Start

Prerequisites

  1. Nextflow (version 21.0 or higher)
curl -s https://get.nextflow.io | bash
  1. Container Engine (Docker or Singularity)
# Docker
docker pull mitopozzi/rnasign:latest

# OR Singularity
singularity pull docker://mitopozzi/rnasign:latest
  1. Model Files - Place in res/ directory
res/
├── best_model_TwoModels.pth
├── TwoModels_RF_classifier.joblib
├── TwoModels_XGB_classifier.joblib
├── MetaLearner_LR_TwoModels.joblib
└── scaler_TwoModels.pkl

Basic Usage

# Run complete pipeline with Docker
nextflow run main.nf -profile docker

# Run with your own BAM files
nextflow run main.nf \
    -profile docker \
    --input_bams "/path/to/bams/*.bam" \
    --output_dir "/path/to/results"

Configuration

Pipeline Parameters

Edit nextflow.config, generate new config file to provide with -C, or provide parameters via command line:

Input/Output Parameters

Parameter Default Description
--input_bams ${baseDir}/bams/*.bam Path pattern for input BAM files
--output_dir ${baseDir}/results Output directory for all results

Pipeline Control Parameters

Parameter Default Options Description
--start_from bedtools bedtools, aggregation, filter, prediction, featurecounts Pipeline entry point
--stop_at featurecounts Same as above Pipeline exit point
--run_featurecounts true true, false Enable/disable quantification
--skip_bedtools false true, false Skip coverage generation
--skip_prediction false true, false Skip ML prediction
--skip_filter false true, false Skip stable region filtering

Execution Profiles

Standard Profile (Local, no containers)

nextflow run main.nf -profile standard

Docker Profile (Recommended)

nextflow run main.nf -profile docker

Singularity Profile

nextflow run main.nf -profile singularity

Test Profile

nextflow run main.nf -profile test,docker

Uses test BAM file: tests/MitoTEST1.bam


Pipeline Stages

Stage 1: Bedtools Coverage Generation

Generates three coverage tracks per sample:

  • Total coverage
  • 3' prime coverage
  • 5' prime coverage

Conda Environment: bedtools_env

Outputs: results/aggregated/{sample_id}/{sample_id}_aggregated.csv

Stage 3: Stable Region Filtering

Extracts stable coverage regions using Filter.py with default parameters:

  • Minimum sequence length size: 70nt
  • Padding size: 50nt
  • Minimum coverage: 100

Conda Environment: python_env_gpu - works for both cpu and gpu

Outputs: results/filtered/{sample_id}/{sample_id}_filtered.csv

Stage 4: ML Prediction

Applies GNN + ensemble learning using Prediction.py:

  • Model type: TM (Two Models - RF + XGBoost + meta-learner)
  • GPU acceleration if available

Conda Environment: python_env_cpu or python_env_gpu

Outputs: results/prediction_predictions/{sample_id}/

  • PRED_*_Predictions_TM.csv - Detailed predictions
  • PRED_*_Predictions_TM_Summary.csv - Summary
  • PRED_*_ClustAnnotation.saf - Clustered miRNA regions

Stage 5: Feature Quantification

Counts reads mapping to predicted miRNA regions using featureCounts.

Conda Environment: featurecounts_env

Outputs: results/featurecounts/

  • {sample_id}_featurecounts.txt - Count matrix
  • {sample_id}_featurecounts.txt.summary - Mapping statistics

Advanced Usage

Running Partial Pipelines

Run only prediction stage

nextflow run main.nf \
    -profile docker \
    --start_from prediction \
    --stop_at prediction \
    --filtered_file "results/filtered/sample1/sample1_filtered.csv"

Skip prediction, run quantification only

nextflow run main.nf \
    -profile docker \
    --start_from featurecounts \
    --prediction_output "results/prediction/*/PRED_*_ClustAnnotation.saf"

Multiple Sample Processing

The pipeline automatically processes all BAM files matching the input pattern in parallel:

nextflow run main.nf \
    -profile docker \
    --input_bams "/data/rnaseq/samples/*.bam" \
    --output_dir "/data/rnaseq/results"

GPU Acceleration

The pipeline automatically detects GPU availability:

  • If NVIDIA GPU detected: uses python_env_gpu
  • Otherwise: uses python_env_cpu

Force CPU mode:

NXF_OPTS='-Xms1g -Xmx4g' nextflow run main.nf -profile docker

Resume Failed Runs

Nextflow automatically caches completed tasks:

nextflow run main.nf -profile docker -resume

Output Directory Structure (WIP)

results/
├── bedtools/
│   └── sample1/
│       ├── sample1.Total.csv
│       ├── sample1.3prime.csv
│       └── sample1.5prime.csv
├── aggregated/
│   └── sample1/
│       └── sample1_aggregated.csv
├── filtered/
│   └── sample1/
│       └── sample1_filtered.csv
├── prediction_predictions/
│   └── sample1/
│       ├── PRED_sample1_filtered_Predictions_TM.csv
│       ├── PRED_sample1_filtered_Predictions_TM_Summary.csv
│       └── PRED_sample1_filtered_ClustAnnotation.saf
└── featurecounts/
    ├── sample1_featurecounts.txt
    └── sample1_featurecounts.txt.summary

Processing Time

Example for 1 sample (~10MB of sorted bam file) run on a desktop with 64GB RAM, ~4GHz CPU, and 12GB GPU:

Stage Time (CPU) Time (GPU)
Bedtools --- ---
Filter --- ---
Prediction --- ---
FeatureCounts --- ---
Total --- ---

Conda Environments

Docker image: https://hub.docker.com/r/mitopozzi/rnasign

The Docker image includes pre-configured conda environments:

bedtools_env

  • bedtools (coverage generation)

featurecounts_env

  • subread (featureCounts quantification)

python_env_cpu

  • Python 3.8+
  • PyTorch (CPU)
  • torch-geometric
  • scikit-learn, XGBoost
  • pandas, numpy, tqdm

python_env_gpu

  • Same as CPU environment
  • PyTorch with CUDA support

Troubleshooting

"Process run_prediction_script failed"

  • Check that model files exist in res/ directory
  • Verify model file permissions

"No bam_file found"

  • Verify --input_bams path pattern
  • Ensure BAM files have .bam extension
  • Check file permissions

"Conda environment not found"

  • Ensure using Docker/Singularity profile
  • If using standard profile, manually create conda environments
  • Check Docker image is properly pulled

Out of memory

  • Reduce number of parallel samples

GPU not detected

  • Check nvidia-smi output
  • Verify CUDA drivers installed
  • Ensure Docker has GPU access (--gpus flag)

Missing output files

  • Check results/ directory permissions
  • Verify publishDir paths in main.nf
  • Review .nextflow.log for errors

Citation (WIP)

If you use this pipeline in your research, please cite:


About

Tool for the identification of miRNAs using neural networks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published