Kraken2 uses k-mers to identify the taxonomy of the microbes in your sample. In essence, they have taken all complete genomes, and then identified all k-mers that are unique to each taxonomic level. Through some nifty computing, and special data structures, they have figured out how to search this very efficiently.
There are a wide range of pre-built kraken databases that you can download, so you do not need to go to the effort of building them yourself.
When installing Kraken2, I recommend setting the
KRAKEN2_DEFAULT_DB variables, and then you do not need to specify them on the command line.
To run Kraken2, use this incantation:
kraken2 --paired --threads 4 --report kraken_taxonomy.txt --output kraken_output.txt \ fastq/reads_1.fastq fastq/reads_2.fastq
This will output two files:
$SRR.kraken_output.txtcontains the standard kraken output:
- A code (C or U) indicating whether the read was classified or not
- The read ID from the fastq file
- The taxonomy ID assigned to the read if it is classified, or 0 if it is not classified
- The length of the sequence in base pairs. Because we are using paired end reads, there are two lengths (R1|R2)
- A space-separated list of the lowest common ancestor for each sequence that indicates how many kmers map to which taxonomic IDs. Because we have paired end information, there is a
|:|separator between the R1 and R2 information
$SRR.kraken_taxonomy.txtcontains the standard kraken report:
- Percent of fragments at that taxonomic level
- Number of fragments at that taxonomic level (the sum of fragments at this level and all those below this level)
- Number of fragments exactly at that taxonomic level
- A taxonomic level code:
Species. If the taxonomy is not one of these the number indicates the levels between this node and the appropriate node. See the docs for more information.
- NCBI Taxonomic name
- Scientific name
For more information about Kraken2, see the wiki page