View on GitHub


Robs manual for the computational genomics and bioinformatics class.

The metagenome assignment

For this assignment we are going to complete some metagenomics analyses. I have provided several metagenomics datasets. Please do not use the Drinking Water data set as it is a 16S sequencing data set and will not work for these analyses. Also, as noted below, if you use the Algae data set, you will get the minimum marks possible as that is the example that we’ve worked through and you can just copy and paste the commands without thinking.

Extra credit: If you want to use your own data set, you are welcome to do so.

Extra credit: If you want to find another data set to use, you can search the SRA for a metagenomic data set and use that instead. You should choose a metagenome that has at least three runs associated with it, as later in the assignment we will use those to create metagenome assembled genomes.

Part 1. Describe your metagenomes

Part 2. Annotating the organisms present in the metagenome

First, we are going to identify the organisms present in the metagenome. There are several ways to do that, but I recommend focus as it is installed in the AWS instances.

There are several other ways to analyse the metagenomes, including mg-rast, MGnify, CLARK, MetaPhlAn, GenomePeek, and plenty of others. In fact, you can read about a host of different software for analysing metagenomes in the CAMI paper: Sczyrba A, Hofmann P, Belmann P, Koslicki D. 2017. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature.

Part 3. Annotating the functions present in the metagenome

Just as with annotating the organisms present in the metagenome, there are several different methods to annotate the functions present in the metagenome.

One approach is to use real time metagenomics, either in the web version or the stand alone version (pro tip: the web version is limited in how many queries it can make at once. The standalone version is not limited. If there is a class running, you probably want to use the standalone version!)

You can also use super-focus. It is installed on the AWS instance, though before you start you will need to use this command to download the appropriate databases.

superfocus_downloadDB -a diamond

You can also use mg-rast, MGnify, and of course other software described in the CAMI paper: Sczyrba A, Hofmann P, Belmann P, Koslicki D. 2017. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature.

Extra credit: How do you summarize the functions? We always see stacked bar charts (gross) or pie charts (grosser) for functions present in metagenomes. I always give additional credit for novel, unique, and different data visualization and presentation approaches (although this is not a data visualization class!)

Part 4. Metagenome Assembled Genomes

As noted above the metagenome needs to have more than one run associated with it for this step to work. Ideally you will have a metagenome with 4 or more runs.

For this aim, we’re going to use cross-assembly to analyze the metagenomes and try to identify complete genomes present in the data.

As you work through the steps associated with cross assembly, here are some questions to answer:

Part 5. Checking the metagenome annotated genomes with CheckM

Once we have assembled the genomes, we want to check the completeness and contamination of the bins. Take a set of highly correlated contigs and create a directory with them. They are a metagenome bin.

Next, we’re going to run CheckM on those contigs, using the description here.

Extra credit: “Highly correlated contigs” is not a well defined term! You probably used Pearson correlation > 0.95 since that was in the previous question! What happens to completeness and contamination as you decrease the correlation coefficient of the contigs in your bin?

Part 6. Visualization with anvi’o.

Finally, we’re going to work through the anvi’o workflow to create a visualization of our bins.

The workflow is described here and is also thoroughly described on the anvi’o website.

Extra credit: Can you use anvi’o to refine the quality of your bins? If you do that, what happens to the checkM scores?

Part 7. Discussion

In the beginning you described the metagenome and why it was sequenced.