RNAsign Nextflow Pipeline

An automated Nextflow pipeline for end-to-end identification of functional miRNA regions from BAM files using coverage analysis, machine learning, and quantification.

Overview

This pipeline runs the complete RNAsign workflow, from sorted BAM files to the identification and quantification of new miRNAs. It works on all types of genomes and relies simply on expression patterns, ignoring sequence-specific features that would make it species-specific. It integrates Bedtools and FeatureCounts along a new set of custom scripts and models into a reproducible, scalable workflow with automatic resource management through Docker or Singularity.

The main innovation is the miRNA prediction pipeline Prediction.py, which uses a ensemble classifier on the embedding generated by a GNN from the transcriptional signatures. This approach is especially useful to generalize the ability to identify novel short functional RNAs across species, as the genetic information is not used

Also, before starting to use the pipeline, remember to unzip the models in /res .

Pipeline Architecture

BAM files
    ↓
[1] bedtools genomecov (Total, 3', 5' BedGraph coverage)
    ↓
[2] Filter.py (Extract stable regions)
    ↓
[3] Prediction.py (GNN + ML prediction)
    ↓
[4] featureCounts (Quantify predicted miRNAs)
    ↓
Count matrices

Features

Automated workflow from BAM to count matrices
Flexible execution with start/stop control at any stage
GPU acceleration with automatic switch between CPU and GPU
Container support (Docker/Singularity)
Parallel processing of multiple samples (depending on system resources)
Intermediate file caching for pipeline resumption
Conda environment management separate conda enviroment for each tool to avoid conflicts

The combination of GNN, Ensemble ML and meta-learner has created a model with almost perfect accuracy:

Quick Start

Prerequisites

Nextflow (version 21.0 or higher)

curl -s https://get.nextflow.io | bash

Container Engine (Docker or Singularity)

# Docker
docker pull mitopozzi/rnasign:latest

# OR Singularity
singularity pull docker://mitopozzi/rnasign:latest

Model Files - Place in res/ directory

res/
├── best_model_TwoModels.pth
├── TwoModels_RF_classifier.joblib
├── TwoModels_XGB_classifier.joblib
├── MetaLearner_LR_TwoModels.joblib
└── scaler_TwoModels.pkl

Basic Usage

# Run complete pipeline with Docker
nextflow run main.nf -profile docker

# Run with your own BAM files
nextflow run main.nf \
    -profile docker \
    --input_bams "/path/to/bams/*.bam" \
    --output_dir "/path/to/results"

Configuration

Pipeline Parameters

Edit nextflow.config, generate new config file to provide with -C, or provide parameters via command line:

Input/Output Parameters

Parameter	Default	Description
`--input_bams`	`${baseDir}/bams/*.bam`	Path pattern for input BAM files
`--output_dir`	`${baseDir}/results`	Output directory for all results

Pipeline Control Parameters

Parameter	Default	Options	Description
`--start_from`	`bedtools`	`bedtools`, `aggregation`, `filter`, `prediction`, `featurecounts`	Pipeline entry point
`--stop_at`	`featurecounts`	Same as above	Pipeline exit point
`--run_featurecounts`	`true`	`true`, `false`	Enable/disable quantification
`--skip_bedtools`	`false`	`true`, `false`	Skip coverage generation
`--skip_prediction`	`false`	`true`, `false`	Skip ML prediction
`--skip_filter`	`false`	`true`, `false`	Skip stable region filtering

Execution Profiles

Standard Profile (Local, no containers)

nextflow run main.nf -profile standard

Docker Profile (Recommended)

nextflow run main.nf -profile docker

Singularity Profile

nextflow run main.nf -profile singularity

Test Profile

nextflow run main.nf -profile test,docker

Uses test BAM file: tests/MitoTEST1.bam

Pipeline Stages

Stage 1: Bedtools Coverage Generation

Generates three coverage tracks per sample:

Total coverage
3' prime coverage
5' prime coverage

Conda Environment: bedtools_env

Outputs: results/aggregated/{sample_id}/{sample_id}_aggregated.csv

Stage 3: Stable Region Filtering

Extracts stable coverage regions using Filter.py with default parameters:

Minimum sequence length size: 70nt
Padding size: 50nt
Minimum coverage: 100

Conda Environment: python_env_gpu - works for both cpu and gpu

Outputs: results/filtered/{sample_id}/{sample_id}_filtered.csv

Stage 4: ML Prediction

Applies GNN + ensemble learning using Prediction.py:

Model type: TM (Two Models - RF + XGBoost + meta-learner)
GPU acceleration if available

Conda Environment: python_env_cpu or python_env_gpu

Outputs: results/prediction_predictions/{sample_id}/

PRED_*_Predictions_TM.csv - Detailed predictions
PRED_*_Predictions_TM_Summary.csv - Summary
PRED_*_ClustAnnotation.saf - Clustered miRNA regions

Stage 5: Feature Quantification

Counts reads mapping to predicted miRNA regions using featureCounts.

Conda Environment: featurecounts_env

Outputs: results/featurecounts/

{sample_id}_featurecounts.txt - Count matrix
{sample_id}_featurecounts.txt.summary - Mapping statistics

Advanced Usage

Running Partial Pipelines

Run only prediction stage

nextflow run main.nf \
    -profile docker \
    --start_from prediction \
    --stop_at prediction \
    --filtered_file "results/filtered/sample1/sample1_filtered.csv"

Skip prediction, run quantification only

nextflow run main.nf \
    -profile docker \
    --start_from featurecounts \
    --prediction_output "results/prediction/*/PRED_*_ClustAnnotation.saf"

Multiple Sample Processing

The pipeline automatically processes all BAM files matching the input pattern in parallel:

nextflow run main.nf \
    -profile docker \
    --input_bams "/data/rnaseq/samples/*.bam" \
    --output_dir "/data/rnaseq/results"

GPU Acceleration

The pipeline automatically detects GPU availability:

If NVIDIA GPU detected: uses python_env_gpu
Otherwise: uses python_env_cpu

Force CPU mode:

NXF_OPTS='-Xms1g -Xmx4g' nextflow run main.nf -profile docker

Resume Failed Runs

Nextflow automatically caches completed tasks:

nextflow run main.nf -profile docker -resume

Output Directory Structure (WIP)

results/
├── bedtools/
│   └── sample1/
│       ├── sample1.Total.csv
│       ├── sample1.3prime.csv
│       └── sample1.5prime.csv
├── aggregated/
│   └── sample1/
│       └── sample1_aggregated.csv
├── filtered/
│   └── sample1/
│       └── sample1_filtered.csv
├── prediction_predictions/
│   └── sample1/
│       ├── PRED_sample1_filtered_Predictions_TM.csv
│       ├── PRED_sample1_filtered_Predictions_TM_Summary.csv
│       └── PRED_sample1_filtered_ClustAnnotation.saf
└── featurecounts/
    ├── sample1_featurecounts.txt
    └── sample1_featurecounts.txt.summary

Processing Time

Example for 1 sample (~10MB of sorted bam file) run on a desktop with 64GB RAM, ~4GHz CPU, and 12GB GPU:

Stage	Time (CPU)	Time (GPU)
Bedtools	---	---
Filter	---	---
Prediction	---	---
FeatureCounts	---	---
Total	---	---

Conda Environments

Docker image: https://hub.docker.com/r/mitopozzi/rnasign

The Docker image includes pre-configured conda environments:

bedtools_env

bedtools (coverage generation)

featurecounts_env

subread (featureCounts quantification)

python_env_cpu

Python 3.8+
PyTorch (CPU)
torch-geometric
scikit-learn, XGBoost
pandas, numpy, tqdm

python_env_gpu

Same as CPU environment
PyTorch with CUDA support

Troubleshooting

"Process run_prediction_script failed"

Check that model files exist in res/ directory
Verify model file permissions

"No bam_file found"

Verify --input_bams path pattern
Ensure BAM files have .bam extension
Check file permissions

"Conda environment not found"

Ensure using Docker/Singularity profile
If using standard profile, manually create conda environments
Check Docker image is properly pulled

Out of memory

Reduce number of parallel samples

GPU not detected

Check nvidia-smi output
Verify CUDA drivers installed
Ensure Docker has GPU access (--gpus flag)

Missing output files

Check results/ directory permissions
Verify publishDir paths in main.nf
Review .nextflow.log for errors

Citation (WIP)

If you use this pipeline in your research, please cite:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
docker		docker
res		res
tests		tests
README.md		README.md
TempLogo.png		TempLogo.png
check_permissions.nf		check_permissions.nf
main.nf		main.nf
nextflow.config		nextflow.config

Mitopozzi/RNAsign

Folders and files

Latest commit

History

Repository files navigation