There are a few robust and well designed microbial genome annotation pipelines that you can use to analyze your genome sequences. Each has its own benefits and drawbacks, and these may dictate which pipeline you end up using.
Creating an assembled genome to annotate
The same approach that we have talked about in other modules was used to generate a test dataset, namely, downloading fastq data from SRA and then assembling the data with spades. To summarize, these are the commands that were used.
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ERS012013 spades.py -o assembly -1 fastq/ERS012013_pass_1.fastq.gz -2 fastq/ERS012013_pass_2.fastq.gz
|Number of sequences||4271|
In all the cases below we use the
scaffolds.fasta output from spades for subsequent analysis.
Example annotation using RAST
Start at the RAST website and from
Your Jobs choose
Upload a new Job. This opens up the file chooser page, and at the file chooser
scaffolds.fasta file. After that file is uploaded, you are presented with a summary of the contigs. Note that RAST may split some of the scaffolds that spades generated, and thus you may have slightly more contigs and slightly shorter sequence size, as shown here. The split happens on runs of
N bases that spades inserts where it can estimate gaps between contigs based on sequence overlap.
The bottom of this page asks for information about the organism you have sequenced. If you enter the taxid, as shown here, the form should populate with information from NCBI.
There a series of questions about the annotation pipeline. Two recommended options are to build metabolic models and fix frameshifts, especially if you have a draft genome. Fixing frameshifts is controversial because some genomes (notably Salmonella enterica serovar Typhi) have a large number of frameshifts that are an evolutionary trait!
Note: at this stage you can also choose to customize some of the options for the RAST pipeline.
Example annotation using PROKKA
Note: The PROKKA GiutHub Site contains many other recipes and advances options for annotating the
scaffolds.fasta file using PROKKA.
Example submission using PATRIC
To annotate the contigs using PATRIC, I first go to the PATRIC website and log in. If you don’t have an account you will need to create one.
Create a new workspace called
Klebsiella by clicking on the
Workspaces menu and going to your
home directory, and then clicking on the new folder icon on the top right.
Then use the
p3 commands to submit the
scaffolds.fasta file for annotation as a genome. You will need to follow these installation instructions to install the
p3 commands, and at the moment they do not provide a CentOS version so it they are not included on the AWS instance.
Once you have installed
p3, you will need to login:
and provide the same credentials that you use for the website.
For the command, we need to provide several variables:
|–contigs-file||the source of the contigs (probably scaffolds.fasta from spades output)|
|-n||the name we want to use for our genome|
|-t||the NCBI Taxonomy ID. For Klebsiella pneumoniae this is 573.
This is used to ensure that the correct parameters are used for the annotation processes.
|-d||the domain (Bacteria, Archaea, Eukarya, or Virus)|
Then we provide the workspace and the file name to call it in the workspace.
p3-submit-genome-annotation --contigs-file scaffolds.fasta -n "Klebsiella pneumoniae NT211489B" -t 573 -d Bacteria /email@example.com/home/Klebsiella "Klebsiella pneumoniae NT211489B"