Removing host genome contamination
We have writen several tools to remove host genomes like Deconseq and written blog posts to describe how to remove host genomes. It’s critical that you do this before you analyse your data, because otherwise you will get spurious results when you do your other analyses.
Step 1. Find your genome
You need to find a host genome that you want to remove. If you are using a non-model organism, take a look at the NCBI Genomes collection that has a lot of curated genomes.
If you are using the human genome, the NCBI has some specific resources:
There are several human genome versions specifically designed for inclusion in pipelines.
A. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
A gzipped file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit.
Note: the two PAR regions on chrY have been hard-masked with Ns.
The chromosome Y sequence provided therefore has the same
coordinates as the GenBank sequence but it is not identical to the
GenBank sequence. Similarly, duplicate copies of centromeric arrays
and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked
with Ns (locations of the unmasked copies are given below).
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence
Note: The EBV sequence is not part of the genome assembly but is
included in the analysis set as a sink for alignment of reads that
are often present in sequencing samples.
B. GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
6. alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units.
C. GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_full_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
D. GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
We usually use GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz which contains everything, but you do need to remember that it contains adenovirus sequences and if you are interested in those they may get removed.
If you want to download it, you can use wget
to get the file:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/GRCh38_major_release_seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
Use minimap2 and samtools to filter the host sequences
You can use any HTS mapper, like bowtie2 or BWA, but nowadays we prefer minimap2 because it is a lot faster.
In this example, we’re using the human genome from above. But if you have downloaded a different genome, just replace your fasta file. Note that another advantage of minimap2
is that it handles gzip
compressed files natively, you don’t need to uncompress them!
minimap2 --split-prefix=tmp$$ -a -xsr GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz R1.fastq.gz R2.fastq.gz | samtools view -bh | samtools sort -o output.bam
samtools index output.bam
Filter mapped reads
Now, we use samtools
flags to filter out the host and non host sequences. The reads that ARE mapped are host, while the reads that ARE NOT mapped are non-host.
We use samtools to get fastq output from the .bam
format files.
You can find out what the flags mean using the samtools flag explainer
Here is the samtools specification, and the description of the columns is on page 6.
host sequences
mkdir host not_host
samtools fastq -F 3588 -f 65 output.bam | gzip -c > host/output_S_R1.fastq.gz
echo "R2 matching host genome:"
samtools fastq -F 3588 -f 129 output.bam | gzip -c > host/output_S_R2.fastq.gz
sequences that are not host
samtools fastq -F 3584 -f 77 output.bam | gzip -c > not_host/output_S_R1.fastq.gz
samtools fastq -F 3584 -f 141 output.bam | gzip -c > not_host/output_S_R2.fastq.gz
samtools fastq -f 4 -F 1 output.bam | gzip -c > not_host/output_S_Singletons.fastq.gz