View on GitHub

ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.

Sequence Assembly

The essential problem that sequence assembly is trying to overcome is that the average microbial genome is 2,000,000 bp, while the typical sequence read length is 150-300 bp for Illumina sequences and upto 50,000 bp for PacBio or Nanopore sequences.

With such (relatively) short sequences, how can we assemble a whole, or nearly whole, genome?

The answer is by repetitively sequencing the same thing over and over again! If we start each sequence at a random location, and we have enough sequences, eventually we can join those sequences together to form what we call contigs.

There are four types of sequence assembly algorithms:

  1. Naive assemblers which just try and find all matching pairs of reads
  2. Greedy assemblers which start with one read and keep adding reads until you can not find any more matches, and then start with the next read.
  3. Overlap-layout-consensus assemblers which layout the reads looking for overlaps between them. The overlaps are usually refined by a Smith-Watermann search, and then a consensus constructed.
  4. de Bruijn graph assemblers

This table describes some of the common sequence assemblers that you will run across.

Name Type Sequencing Tech Citation Documentation Homepage
SPAdes genomes, single-cell, metagenomes, ESTs Illumina, Solexa, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore Nurk et al. 2013 version 3.12 manual SPAdes
Velvet genomes Sanger, 454, Solexa, SOLiD Zerbino and Birney, 2008 version 1.12 manual EBI
Canu genomes PacBio/Oxford Nanopore reads Koren et al. 2017 manual for all versions Git repo
MaSuRCA Any size, haploid/diploid genomes Illumina and PacBio/Oxford Nanopore data, legacy 454 and Sanger data Zimin A, et al. 2017 Git Repo Git Repo
megahit Ultra=-fast and memory efficient NGS assembler Illumina Li et al., git repo  
Hinge Small microbial genomes PacBio/Oxford Nanopore reads Kamath et al. 2017 jupyter notebook Git repo
Unicycler Illumina-only data, and can optimize SPAdes Short reads Wick et al. Git Repo  
Flye De novo assembly of long reads (PacBio/Oxford Nanopore), but can also combine other assemblies Long reads, other assemblies Kolmogorov et al. Git Repo  
miniasm + minipolish Long read assembler and polishing together Long reads Wick and Holt Git repo  
raven Assembler for long, uncorrected reads Long reads TBD Git repo  
Trycycler Not really an assembler, per se, but more an approach to merging assemblies. Other assemblies DOI:10.5281/zenodo.3965017 git Repo  

For the most comprehensive comparison of sequence assemblers, we encourage you to review Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2019;8(2138).

We use the St. Petersburg genome assembler, SPAdes and the version installed on the AWS instances is 3.12.0 for which the manual is here

For Nanopore reads we typically use the CANU assembler.

Running SPAdes

SPAdes is easy to run! The basic command is

spades.py

The program takes a couple of inputs - your fastq files, for example that you download from ../Databases/SRA.

If you have paired end reads, you need to add -1 for the left pairs (the file called xxx_1.fastq) and -2 for the right pairs (the file called xxx_2.fastq). Note that spades handles gzip compressed files, and you do not need to decompress them!

If you unpaired reads, you can specify that with the -s flag.

You also need to provide an output directory name where the results will be written using the -o flag.

Your final command might look something like:

spades.py -1 fastq/ERS011900_pass_1.fastq.gz -2 fastq/ERS011900_pass_2.fastq.gz -o assembly

SPAdes output files

SPAdes makes a lot of files and directories in the output, and this summarizes what those files are. Of course, more details can be found in the SPAdes manual