Strain validation

Snakemake pipeline to validate mutants made in the lab

Summary

This pipeline performs two steps:

Firstly, it creates synteny dot plots using long reads based on an in house pipeline cauris_dotplot
Secondly, it runs snippy, a variant calling tool, that takes paired end reads and a reference genome.

The workflow generates all the output in the prefix folder set in config/config.yaml. Each workflow steps gets its own individual folder as shown below. This structure provides a general view of how outputs are organized, with each tool or workflow step having its own directory. Note that this overview does not capture all possible outputs from each tool; it only highlights the primary directories and SOME of their contents.

The synteny plots and variant information can be found in the final_results folder.

results/2025-06-03_Test_Strain_Validation_Pipeline/
├── cauris_dotplot_repo
│   ├── dotplot.py
│   ├── highlight_data.tsv
│   ├── make_plots.R
│   ├── README.md
│   ├── run_dotplot.job
│   └── sloop.py
├── dotplots
│   └── MI_KPC_112_C
│       ├── contig_data
│       │   ├── MI_KPC_112_C_contig_data.csv
│       │   └── MI_KPC_112_flye_medaka_polypolish_contig_data.csv
│       ├── MI_KPC_112_C_debug_log.txt
│       ├── nucmer
│       │   ├── MI_KPC_112_C.coord
│       │   └── MI_KPC_112_C.delta
│       └── plots
├── final_results
│   ├── MI_KPC_112_C.csv
│   └── MI_KPC_112_C.pdf
└── snippy
    └── MI_KPC_112_C
        ├── MI_KPC_112_C.aligned.fa
        ├── MI_KPC_112_C.bam
        ├── MI_KPC_112_C.filt.vcf
        ├── MI_KPC_112_C.gff
        ├── MI_KPC_112_C.html
        ├── MI_KPC_112_C.log
        ├── MI_KPC_112_C.raw.vcf
        ├── MI_KPC_112_C.subs.vcf
        ├── MI_KPC_112_C.tab
        ├── MI_KPC_112_C.txt
        ├── MI_KPC_112_C.vcf
        ├── MI_KPC_112_C.vcf.gz
        ├── MI_KPC_112_C.vcf.gz.csi
        ├── reference
        ├── ref.fa -> reference/ref.fa
        └── ref.fa.fai -> reference/ref.fa.fai

Installation

If you are using Great Lakes HPC, ensure you are cloning the repository in your scratch directory. Change your_uniqname to your uniqname.

cd /scratch/esnitkin_root/esnitkin1/your_uniqname/

Clone the github directory onto your system.


git clone https://github.com/Snitkin-Lab-Umich/strain_validation.git

Ensure you have successfully cloned strain_validation. Type ls and you should see the newly created directory strain_validation. Move to the newly created directory.


cd strain_validation

To ensure a clean starting environment, deactivate and/or remove any modules/packages loaded.

module purge
conda deactivate

Load bioinformatics, snakemake, singularity and R modules from Great Lakes modules.


module load Bioinformatics snakemake singularity R

Setup config, sample and cluster files

If you are just testing this pipeline, the config and sample files are already loaded with test data, so you do not need to make any additional changes to them. However, it is a good idea to change the prefix (name of your output folder) in the config file to give you an idea of what variables need to be modified when running your own samples.

Customize config.yaml and set tool specific parameters

As an input, the snakemake file takes a config file where you can set the path to a samples sheet, path to ONT long reads and illumina short reads, etc. Instructions on how to modify config/config.yaml is found in config.yaml.

Samples

This sample file should contain 7 columns: Reference_genome, Sample_name, Illumina_F, Illumina_R and ONT_assembly, ref_genome_path_gbk and ref_genome_path_fasta. An example of how the sample sheet should look like can be found here /nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/2025-04-16_EXAMPLE_CURING_EXP/2025-07-16_Test_sample_sheet.csv.

To create the samples file, you need to run generate_sample_sheet.py.

Couple of assumptions before you run the aforementioned python script.

If you run generate_sample_sheet.py with all the flags including --dryrun, you should be able to see which files are being moved, where it is being moved and if there are any missing sequencing data.
You will be running generate_sample_sheet.py from the Plasmid_curing directory i.e. here /nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/
/nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/Plasmidsaurus_data/ contains all ONT & Illumina data (*_results, *_Illumina_fastq where * is plasmidsaurus BATCH NAME).
File names /nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/Plasmidsaurus_data/MDHHS_hybrid_gbf_paths.txt and /nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/MDHHS_hybrid_genome_assembly_paths.txt are not altered—otherwise, the script will break.

Once you have read the above expectations, copy paste the help command to view all options to create samples file.

python3 generate_sample_sheet.py -h

An example of what the command would look like:

python3 generate_sample_sheet.py \
--lookup_file /scratch/esnitkin_root/esnitkin1/dhatrib/test_pipelines/Plasmid_curing/Plasmidsaurus_data/2025-07-11_curing_batch_2_sample_lookup.csv \
--output_dir /scratch/esnitkin_root/esnitkin1/dhatrib/test_pipelines/strain_validation/config/ \
--prefix 2025-07-16_Test \
--plasmid_curing_dir /nfs/turbo/umms-esnitkin/Project_MIDGE_Bac/Analysis/Plasmid_curing/

Cluster file

Change the walltime according to the number of samples you have in config/cluster.json to ensure the jobs are being submitted in a timely manner. Update email flag to your email.

Quick start

Run pipeline on a set of samples.

Preview the steps in Strain validation by performing a dryrun of the pipeline.


snakemake -s workflow/Snakefile --dryrun

Submit the pipeline as a batch job. Change these SBATCH commands: --job-name to a more descriptive name like run_strainval, --mail-user to your email address, --time depending on the number of samples you have (should be more than what you specified in config/cluster.json). Feel free to make changes to the other flags if you are comfortable doing so. The sbat script can be found in the current directory—it's called StrainValidation.sbat. Don't forget to submit the script to Slurm! sbatch StrainValidation.sbat.

#!/bin/bash

#SBATCH --job-name=strain_val
#SBATCH --mail-user=youremail@umich.edu
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --export=ALL
#SBATCH --partition=standard
#SBATCH --account=esnitkin1
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=3 --mem=10g --time=08:15:00

# Load necessary modules
module load Bioinformatics snakemake singularity R mummer/4.0.0rc1 Rtidyverse/4.4.3

# Run pipeline
snakemake -s workflow/Snakefile --use-envmodules -j 999 --cluster "sbatch -A {cluster.account} -p {cluster.partition} -N {cluster.nodes}  -t {cluster.walltime} -c {cluster.procs} --mem-per-cpu {cluster.pmem} --output=slurm_out/slurm-%j.out" --cluster-config config/cluster.json --configfile config/config.yaml --latency-wait 100 --nolock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Strain validation

Summary

Installation

Setup config, sample and cluster files

Customize config.yaml and set tool specific parameters

Samples

Cluster file

Quick start

Run pipeline on a set of samples.

DAG of pipeline

Dependencies

Near Essential

Tool stack used in workflow

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
config		config
images		images
workflow		workflow
README.md		README.md
StrainValidation.sbat		StrainValidation.sbat
generate_sample_sheet.py		generate_sample_sheet.py

Snitkin-Lab-Umich/strain_validation

Folders and files

Latest commit

History

Repository files navigation

Strain validation

Summary

Installation

Setup config, sample and cluster files

Customize config.yaml and set tool specific parameters

Samples

Cluster file

Quick start

Run pipeline on a set of samples.

DAG of pipeline

Dependencies

Near Essential

Tool stack used in workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages