Nextflow pipeline for polishing an assembly with short reads and freebayes
When using error-prone long reads (e.g., PacBio CLR) to assemble a genome, it is often necessary to use short reads (e.g., Illumina) to error-correct ("polish") the assembly. This repository contains a nextflow pipeline implementing the Vertebrate Genomes Project best practices for using freebayes to polish an assembly (see their github for more information).
- Short reads
- An assembly created from erroneous long reads
- nextflow — can be installed with the command
curl -s https://get.nextflow.io | bash
This pipeline is set up to use mamba to create an environment with these programs in it, but you can always install them yourself or use modules. See configuration section below.
Nextflow configuration is handled by the file nextflow.config in the directory
where you're running the nextflow command. The configuration file in this
repository is for running the pipeline on the lewis cluster at Mizzou using
SLURM and mamba, but you can adjust it to use any batch or cloud system you
want. Check out the nextflow docs
for more information.
Nextflow needs a filesystem where locking is allowed for keeping track of which
jobs are running, but not to actually store the data or temporary files you're
creating. On Lewis, HTC allows locking but is slow and HPC does not allow
locking but is fast. To take advantage of the best of both worlds, run this
pipeline from within a project directory in HTC, but set the environment
variable $NXF_WORK to point to an empty directory on HPC.
To download the pipeline and run it on your assembly, just run the command:
nextflow run WarrenLab/shortread-polish-nf \
--assembly unpolished_assembly.fa \
--sra SRX1234567
This will download short reads from SRA, align all the reads to your reference,
and then use the alignments to correct the assembly. The output will be in
consensus/polished.fa, and there will be a report of the number of changes
made at consensus/report.txt.
You can also use the option --fastq instead of --sra to polish with local
fastq files instead of reads from SRA.