An automated Nextflow pipeline for end-to-end identification of functional miRNA regions from BAM files using coverage analysis, machine learning, and quantification.
This pipeline runs the complete RNAsign workflow, from sorted BAM files to the identification and quantification of new miRNAs. It works on all types of genomes and relies simply on expression patterns, ignoring sequence-specific features that would make it species-specific. It integrates Bedtools and FeatureCounts along a new set of custom scripts and models into a reproducible, scalable workflow with automatic resource management through Docker or Singularity.
The main innovation is the miRNA prediction pipeline Prediction.py, which uses a ensemble classifier on the embedding generated by a GNN from the transcriptional signatures. This approach is especially useful to generalize the ability to identify novel short functional RNAs across species, as the genetic information is not used
Also, before starting to use the pipeline, remember to unzip the models in /res .
BAM files
↓
[1] bedtools genomecov (Total, 3', 5' BedGraph coverage)
↓
[2] Filter.py (Extract stable regions)
↓
[3] Prediction.py (GNN + ML prediction)
↓
[4] featureCounts (Quantify predicted miRNAs)
↓
Count matrices
- Automated workflow from BAM to count matrices
- Flexible execution with start/stop control at any stage
- GPU acceleration with automatic switch between CPU and GPU
- Container support (Docker/Singularity)
- Parallel processing of multiple samples (depending on system resources)
- Intermediate file caching for pipeline resumption
- Conda environment management separate conda enviroment for each tool to avoid conflicts
The combination of GNN, Ensemble ML and meta-learner has created a model with almost perfect accuracy:
- Nextflow (version 21.0 or higher)
curl -s https://get.nextflow.io | bash- Container Engine (Docker or Singularity)
# Docker
docker pull mitopozzi/rnasign:latest
# OR Singularity
singularity pull docker://mitopozzi/rnasign:latest- Model Files - Place in
res/directory
res/
├── best_model_TwoModels.pth
├── TwoModels_RF_classifier.joblib
├── TwoModels_XGB_classifier.joblib
├── MetaLearner_LR_TwoModels.joblib
└── scaler_TwoModels.pkl
# Run complete pipeline with Docker
nextflow run main.nf -profile docker
# Run with your own BAM files
nextflow run main.nf \
-profile docker \
--input_bams "/path/to/bams/*.bam" \
--output_dir "/path/to/results"Edit nextflow.config, generate new config file to provide with -C, or provide parameters via command line:
| Parameter | Default | Description |
|---|---|---|
--input_bams |
${baseDir}/bams/*.bam |
Path pattern for input BAM files |
--output_dir |
${baseDir}/results |
Output directory for all results |
| Parameter | Default | Options | Description |
|---|---|---|---|
--start_from |
bedtools |
bedtools, aggregation, filter, prediction, featurecounts |
Pipeline entry point |
--stop_at |
featurecounts |
Same as above | Pipeline exit point |
--run_featurecounts |
true |
true, false |
Enable/disable quantification |
--skip_bedtools |
false |
true, false |
Skip coverage generation |
--skip_prediction |
false |
true, false |
Skip ML prediction |
--skip_filter |
false |
true, false |
Skip stable region filtering |
nextflow run main.nf -profile standardnextflow run main.nf -profile dockernextflow run main.nf -profile singularitynextflow run main.nf -profile test,dockerUses test BAM file: tests/MitoTEST1.bam
Generates three coverage tracks per sample:
- Total coverage
- 3' prime coverage
- 5' prime coverage
Conda Environment: bedtools_env
Outputs: results/aggregated/{sample_id}/{sample_id}_aggregated.csv
Extracts stable coverage regions using Filter.py with default parameters:
- Minimum sequence length size: 70nt
- Padding size: 50nt
- Minimum coverage: 100
Conda Environment: python_env_gpu - works for both cpu and gpu
Outputs: results/filtered/{sample_id}/{sample_id}_filtered.csv
Applies GNN + ensemble learning using Prediction.py:
- Model type: TM (Two Models - RF + XGBoost + meta-learner)
- GPU acceleration if available
Conda Environment: python_env_cpu or python_env_gpu
Outputs: results/prediction_predictions/{sample_id}/
PRED_*_Predictions_TM.csv- Detailed predictionsPRED_*_Predictions_TM_Summary.csv- SummaryPRED_*_ClustAnnotation.saf- Clustered miRNA regions
Counts reads mapping to predicted miRNA regions using featureCounts.
Conda Environment: featurecounts_env
Outputs: results/featurecounts/
{sample_id}_featurecounts.txt- Count matrix{sample_id}_featurecounts.txt.summary- Mapping statistics
nextflow run main.nf \
-profile docker \
--start_from prediction \
--stop_at prediction \
--filtered_file "results/filtered/sample1/sample1_filtered.csv"nextflow run main.nf \
-profile docker \
--start_from featurecounts \
--prediction_output "results/prediction/*/PRED_*_ClustAnnotation.saf"The pipeline automatically processes all BAM files matching the input pattern in parallel:
nextflow run main.nf \
-profile docker \
--input_bams "/data/rnaseq/samples/*.bam" \
--output_dir "/data/rnaseq/results"The pipeline automatically detects GPU availability:
- If NVIDIA GPU detected: uses
python_env_gpu - Otherwise: uses
python_env_cpu
Force CPU mode:
NXF_OPTS='-Xms1g -Xmx4g' nextflow run main.nf -profile dockerNextflow automatically caches completed tasks:
nextflow run main.nf -profile docker -resumeresults/
├── bedtools/
│ └── sample1/
│ ├── sample1.Total.csv
│ ├── sample1.3prime.csv
│ └── sample1.5prime.csv
├── aggregated/
│ └── sample1/
│ └── sample1_aggregated.csv
├── filtered/
│ └── sample1/
│ └── sample1_filtered.csv
├── prediction_predictions/
│ └── sample1/
│ ├── PRED_sample1_filtered_Predictions_TM.csv
│ ├── PRED_sample1_filtered_Predictions_TM_Summary.csv
│ └── PRED_sample1_filtered_ClustAnnotation.saf
└── featurecounts/
├── sample1_featurecounts.txt
└── sample1_featurecounts.txt.summary
Example for 1 sample (~10MB of sorted bam file) run on a desktop with 64GB RAM, ~4GHz CPU, and 12GB GPU:
| Stage | Time (CPU) | Time (GPU) |
|---|---|---|
| Bedtools | --- | --- |
| Filter | --- | --- |
| Prediction | --- | --- |
| FeatureCounts | --- | --- |
| Total | --- | --- |
Docker image: https://hub.docker.com/r/mitopozzi/rnasign
The Docker image includes pre-configured conda environments:
- bedtools (coverage generation)
- subread (featureCounts quantification)
- Python 3.8+
- PyTorch (CPU)
- torch-geometric
- scikit-learn, XGBoost
- pandas, numpy, tqdm
- Same as CPU environment
- PyTorch with CUDA support
"Process run_prediction_script failed"
- Check that model files exist in
res/directory - Verify model file permissions
"No bam_file found"
- Verify
--input_bamspath pattern - Ensure BAM files have
.bamextension - Check file permissions
"Conda environment not found"
- Ensure using Docker/Singularity profile
- If using standard profile, manually create conda environments
- Check Docker image is properly pulled
Out of memory
- Reduce number of parallel samples
GPU not detected
- Check nvidia-smi output
- Verify CUDA drivers installed
- Ensure Docker has GPU access (--gpus flag)
Missing output files
- Check
results/directory permissions - Verify
publishDirpaths in main.nf - Review
.nextflow.logfor errors
If you use this pipeline in your research, please cite:

