Sequence Assembly
The essential problem that sequence assembly is trying to overcome is that the average microbial genome is 2,000,000 bp, while the typical sequence read length is 150-300 bp for Illumina sequences and upto 50,000 bp for PacBio or Nanopore sequences.
With such (relatively) short sequences, how can we assemble a whole, or nearly whole, genome?
The answer is by repetitively sequencing the same thing over and over again! If we start each sequence at a random location, and we have enough sequences, eventually we can join those sequences together to form what we call contigs.
There are four types of sequence assembly algorithms:
- Naive assemblers which just try and find all matching pairs of reads
- Greedy assemblers which start with one read and keep adding reads until you can not find any more matches, and then start with the next read.
- Overlap-layout-consensus assemblers which layout the reads looking for overlaps between them. The overlaps are usually refined by a Smith-Watermann search, and then a consensus constructed.
- de Bruijn graph assemblers
This table describes some of the common sequence assemblers that you will run across.
Name | Type | Sequencing Tech | Citation | Documentation | Homepage |
---|---|---|---|---|---|
SPAdes | genomes, single-cell, metagenomes, ESTs | Illumina, Solexa, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore | Nurk et al. 2013 | version 3.12 manual | SPAdes |
Velvet | genomes | Sanger, 454, Solexa, SOLiD | Zerbino and Birney, 2008 | version 1.12 manual | EBI |
Canu | genomes | PacBio/Oxford Nanopore reads | Koren et al. 2017 | manual for all versions | Git repo |
MaSuRCA | Any size, haploid/diploid genomes | Illumina and PacBio/Oxford Nanopore data, legacy 454 and Sanger data | Zimin A, et al. 2017 | Git Repo | Git Repo |
megahit | Ultra=-fast and memory efficient NGS assembler | Illumina | Li et al., | git repo | |
Hinge | Small microbial genomes | PacBio/Oxford Nanopore reads | Kamath et al. 2017 | jupyter notebook | Git repo |
Unicycler | Illumina-only data, and can optimize SPAdes | Short reads | Wick et al. | Git Repo | |
Flye | De novo assembly of long reads (PacBio/Oxford Nanopore), but can also combine other assemblies | Long reads, other assemblies | Kolmogorov et al. | Git Repo | |
miniasm + minipolish | Long read assembler and polishing together | Long reads | Wick and Holt | Git repo | |
raven | Assembler for long, uncorrected reads | Long reads | TBD | Git repo | |
Trycycler | Not really an assembler, per se, but more an approach to merging assemblies. | Other assemblies | DOI:10.5281/zenodo.3965017 | git Repo |
For the most comprehensive comparison of sequence assemblers, we encourage you to review Wick RR, Holt KE. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research. 2019;8(2138).
We use the St. Petersburg genome assembler, SPAdes and the version installed on the AWS instances is 3.12.0 for which the manual is here
For Nanopore reads we typically use the CANU assembler.
Running SPAdes
SPAdes is easy to run! The basic command is
spades.py
The program takes a couple of inputs - your fastq
files, for example that you download from ../Databases/SRA.
If you have paired end reads, you need to add -1
for the left pairs (the file called xxx_1.fastq) and -2
for the right pairs (the file called xxx_2.fastq). Note that spades handles gzip
compressed files, and you do not need to decompress them!
If you unpaired reads, you can specify that with the -s
flag.
You also need to provide an output directory name where the results will be written using the -o
flag.
Your final command might look something like:
spades.py -1 fastq/ERS011900_pass_1.fastq.gz -2 fastq/ERS011900_pass_2.fastq.gz -o assembly
SPAdes output files
SPAdes makes a lot of files and directories in the output, and this summarizes what those files are. Of course, more details can be found in the SPAdes manual
scaffolds.fasta
contains the scaffolds generated by SPAdes and is the output file you want to use.- the directory
/corrected/
contains reads corrected by BayesHammer in compressed fastq format contigs.fasta
contains the contigs before they are scaffolded into scaffolds. Often this is similar to the scaffolds.fasta depending on how much scaffolding information there isassembly_graph.gfa
contains the assembly graph and scaffolds paths in GFA 1.0 formatassembly_graph.fastg
contains the assembly graph in FASTG formatcontigs.paths
contains paths in the assembly graph corresponding to contigs.fasta. This is how the graph is resolved into contigs.scaffolds.paths
contains paths in the assembly graph corresponding to scaffolds.fasta.K21
,K33
,K55
, etc are directories containing the de Bruijn graph assemblies for different lengths of kbefore_rr.fasta
are the assembled contigs before repeat resolution has been applied.dataset.info
andinput_dataset.yaml
contain information about the sequence read files that were supplied.params.txt
is a summary of all the spades parametersspades.log
is the log that was printed to the screen while SPAdes was running. This contains lots of information about the assembly process.