Skip to content

Metagenomics | shotgun workflow tutorial for Data Science and Bioinformatics (UEA, BIO7051B)

License

Notifications You must be signed in to change notification settings

LCrossman/Metagenomics_shotgun_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metagenomics_shotgun_tutorial

Metagenomics | shotgun workflow tutorial | Data Science and Bioinformatics (BIO-7051B)


Currently NOT FULLY WORKING this is a draft only

Learning Objectives

  • Understand the basics of shotgun Metagenomics analysis to go from reads -> metagenome bins
  • To investigate bins in terms of 'who is there' and 'how much'
  • Understand some limitations of current analyses

Background

You are provided with a paired-end Illumina short read dataset. In this tutorial we will investigate the quality of the reads and align them to a provided metagenome assembly. The assembly is provided for you because assembling metagenomic data requires significant compute in both time and resource.
Multiple metagenome assembly programs are available. This assembly was carried out with Megahit software. Once reads are aligned, we calculate the coverage of the reads on each of the contigs in the assembly. We use the assembly and coverage stats along with different binning software to extract single genomes (bins) from the dataset. We then identify the quality of these bins and assign them to microbial taxa.


Step 1. - Read Alignments

(Recommended to begin before the session because it takes ~ 20 minutes to align the reads)

Contents

1.1 Raw Data You will find the metagenome assembly in the GitHub repository (CommunityScaffolds.fasta). The paired-end short read data is at /gpfs/data/BIO-DBS/Session8/MG_workflow_2026/RawData/ You will need these three files for the next step.

Fasta Assembly file: CommunityScaffolds.fasta

Short Reads R1: EDME200007170-1a_HCYHVDSXY_L2_1.fq.gz

Short Reads R2: EDME200007170-1a_HCYHVDSXY_L2_2.fq.gz

1.2 Read Aligns You have already carried out read alignments in this module. You can use either these scripts to align these reads (since these reads have already been quality checked) or go through your previous methods with BWA to produce the required output of this stage - which is a sorted Bam file containing all the read alignments to the metagenome assembly.

Script: ReadAligns/readalign_bam_and_sort.sh

Purpose: Aligning illumina short reads to sam format, convert to bam and sort alignments

Output: A sorted bam file containing read alignment data


Step 2. - Binning

  • Try to run the binning pipeline
  • Investigate the questions It is highly recommended to run this on HPC because the binning software and checkM can be tricky to install!

2.1 Binning with Metabat2 Here you can use alternative binning software to investigate the metagenome. We first use Metabat2 and some python scripts to identfy some bins. We then use binspreader-R to improve the bins.

Script: Binning/binning_workflow.sh

Purpose: Metagenomic binning to extract single genomes from the mixture

Outputs: For each binning program you will output some stats and bins as a set of genome fasta files

2.2 Binning with Semibin2 Next you can use a newer machine learning binning algorithm (Sembin2) to bin the metagenome assembly. Semibin2 use deep learning neural networks to bin the assembly contigs instead of rules and statistic based methods

Script: Binning/ML_binning_workflow.sh

Purpose: Metagenomic binning to extract single genomes from a mixture using a machine learning method

Outputs: Stats and bins as a set of genome fasta files

Step 3. - Assess Bin Quality

Check Quality of the bins You will need to check the quality of all the bins using the program CheckM which outputs statistics such as the completeness and contamination.

Script: checkm_quality.sh

Purpose: To identify the quality of each bin.

Output Files: set of checkm output files

Step 4. Identify Taxa

Now you can identify taxa on the best quality bins using Kraken2.

Script: taxa_classify_bins.sh

Purpose: Identify Taxa for each bin fasta file

Output: Bin reports - for each bin there will be a taxa report with abundance from Kraken/Bracken

Step 5. Conclusions

Make some conclusions about your results both technically and biologically.

Step 6. Optional - Plotting scripts

Scripts: XX

Data: Your output from steps 4 or 5.

Purpose: Starting material for Group Project

Output: Result figures


Group Project Ideas

Take the analyses further with the following suggestions:

  • Binning parameters: Metabat2 has a large number of parameters that can be used to fine-tune your results. Read the short software manual for metabat2 to identify these parameters, alter some and identify using checkM whether the parameters improved the bin qualities or not. Try to make some conclusions about why.

  • Consider the composition of the provided dataset. You may be able to find out what it is from online from the organisms you identified. There are particular features of this dataset that make it both (a) easier and (b) more difficult to bin than some other datasets. What are these features?

  • Data Visualization (use R): Plot graphs of the data from step 4 (e.g. numbers of bins, length in bp of each fasta, N50) and/or step 5 (e.g. contamination, heterogeneity, completeness of different binning methods) using R ggplot2 or similar. Do the results show any particular trends?

  • Long Reads: You can find a dataset of long reads for the same data (Oxford Nanopore) in [..] You can use these in the Metabat2 binning program together with the short read assembly CommunityScaffolds.fasta as if they are metagenomic contigs. HINT: use linux to join the files together using:

cat CommunityScaffolds.fasta LongReads.fasta > TotalDataset.fasta

Use checkM to identify if the long reads improved the results or not.

  • Real world Garden Data: Ask me for three separate short read metagenomic datasets from soil in a Professor's garden. Is the microbial composition in the three samples identical/similar or different?

About

Metagenomics | shotgun workflow tutorial for Data Science and Bioinformatics (UEA, BIO7051B)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published