View on GitHub

ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.

Annotation Pipelines

There are a few robust and well designed microbial genome annotation pipelines that you can use to analyze your genome sequences. Each has its own benefits and drawbacks, and these may dictate which pipeline you end up using.

Creating an assembled genome to annotate

The same approach that we have talked about in other modules was used to generate a test dataset, namely, downloading fastq data from SRA and then assembling the data with spades. To summarize, these are the commands that were used.

I downloaded the reads from ERS012013, which is part of Kat Holt’s Klebsiella dataset (Project ID PRJEB2111), and assembled them using spades.

fastq-dump --outdir fastq --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-3 --clip ERS012013
spades.py -o assembly -1 fastq/ERS012013_pass_1.fastq.gz -2 fastq/ERS012013_pass_2.fastq.gz

Statistic	Value
Number of sequences	4271
Total length	8,716,579
Shortest contig	56
Longest contig	86,652
N₅₀	11,152
N₇₅	28,825

In all the cases below we use the scaffolds.fasta output from spades for subsequent analysis.

Example annotation using RAST

Start at the RAST website and from Your Jobs choose Upload a new Job. This opens up the file chooser page, and at the file chooser

file chooser

select your scaffolds.fasta file. After that file is uploaded, you are presented with a summary of the contigs. Note that RAST may split some of the scaffolds that spades generated, and thus you may have slightly more contigs and slightly shorter sequence size, as shown here. The split happens on runs of N bases that spades inserts where it can estimate gaps between contigs based on sequence overlap.

RAST summary statistics

The bottom of this page asks for information about the organism you have sequenced. If you enter the taxid, as shown here, the form should populate with information from NCBI.

RAST Job Information

There a series of questions about the annotation pipeline. Two recommended options are to build metabolic models and fix frameshifts, especially if you have a draft genome. Fixing frameshifts is controversial because some genomes (notably Salmonella enterica serovar Typhi) have a large number of frameshifts that are an evolutionary trait!

RAST Pipeline

Note: at this stage you can also choose to customize some of the options for the RAST pipeline.

Example annotation using PROKKA

Note: The PROKKA GiutHub Site contains many other recipes and advances options for annotating the scaffolds.fasta file using PROKKA.

Example submission using PATRIC

To annotate the contigs using PATRIC, I first go to the PATRIC website and log in. If you don’t have an account you will need to create one.

Create a new workspace called Klebsiella by clicking on the Workspaces menu and going to your home directory, and then clicking on the new folder icon on the top right.

Then use the p3 commands to submit the scaffolds.fasta file for annotation as a genome. You will need to follow these installation instructions to install the p3 commands, and at the moment they do not provide a CentOS version so it they are not included on the AWS instance.

Once you have installed p3, you will need to login:

p3-login

and provide the same credentials that you use for the website.

For the command, we need to provide several variables:

Variable	Definition
–contigs-file	the source of the contigs (probably scaffolds.fasta from spades output)
-n	the name we want to use for our genome
-t	the NCBI Taxonomy ID. For Klebsiella pneumoniae this is 573. This is used to ensure that the correct parameters are used for the annotation processes.
-d	the domain (Bacteria, Archaea, Eukarya, or Virus)

Then we provide the workspace and the file name to call it in the workspace.

p3-submit-genome-annotation --contigs-file scaffolds.fasta -n "Klebsiella pneumoniae NT211489B" -t 573 -d Bacteria /redwards@patricbrc.org/home/Klebsiella "Klebsiella pneumoniae NT211489B"