COMBINE Metagenomics Workshop, 2024
Pawsey, November 11th and 12th, 2024.
Workshop Location
The workshop will be held at the Pawsey Supercomputing Centre, 1 Bryce Avenue, Kensington WA 6151. The workshop is for registered attendees only.
Workshop Schedule
Please note: The workshop schedule is subject to change depending on how quickly or slowly we progress.
Day 1: Monday November 11th
Time | Topic |
---|---|
0900-0915 | Welcome and Introductions. |
0915-1000 | Introduction to metagenomics, Linux, and Bash |
1000-1030 | Bash hands on and practice: using conda |
1030-1100 | Morning coffee |
1100-1200 | Downloading data and filtering human genomes using minimap2 |
1200-1300 | Lunch |
1300-1330 | Introduction to methods for identifying species focus, SingleM |
1330-1500 | Hands-on with SingleM and focus |
1500-1530 | Afternoon tea |
1530-1600 | Introduction to methods for functional analysis |
1600-1700 | SUPER-FOCUS hands-on |
Day 2: Tuesday November 12th
Time | Topic |
---|---|
0900-0930 | Recap of Day 1 |
0930-1000 | Introduction to binning |
1000-1030 | Megahit assembly |
1030-1100 | Morning coffee |
1100-1200 | Binning by hand |
1200-1230 | Viral identification using Hecatomb |
1230-1330 | Lunch |
1400-1500 | Hands-on data visualisation |
1500-1530 | Afternoon tea |
1530-1645 | Hands-on data visualisation |
1645-1700 | Wrap up and summary |
Download
If you are using a MS Windows machine, please download and install MobaXterm before we start. If you are using a Mac, you are good to go!
Metagenomics
We are going to jump right in with metagenomics, but here is a brief introduction if you want to read something while Rob is talking.
We have created accounts for you on pawsey, and we will share the usernames and passwords with you at the workshop. These are temporary accounts and will be deleted at the end of the workshop
Installing software
Our first excercise is installing software using mamba.
Before we begin, we are going to make lives slightly easier for ourselves by making an alias
or symbolic link
:
ln -s /software/projects/courses01/$USER software
ln -s /scratch/courses01/$USER/miniforge3 software/miniforge3
This will create a directory called software.
Important: When you install mamba, it will ask you for a location. Use /home/$USER/software/miniforge3
as the location
Install conda
, fastp
, minimap2
, samtools
using conda
We will use all of these programs today.
You can check that they installed by using the command:
which fastp
If that works, it will tell you!
Downloading Data
We are going to use the CF data that Rob talked about. To start we are just going to download two files, an R1 and an R2 file to work with:
mkdir fastq
cd fastq
curl -LO https://github.com/linsalrob/ComputationalGenomicsManual/raw/refs/heads/master/Datasets/CF/788707_20180129_S_R1.fastq.gz
curl -LO https://github.com/linsalrob/ComputationalGenomicsManual/raw/refs/heads/master/Datasets/CF/788707_20180129_S_R2.fastq.gz
cd
ls
Use fastp
to trim bad sequences and remove the adapters.
We are going to use the Illumina Adapters, and trim out:
- Sequences that are less 100 bp
- Sequences that contain 1 N
- Trim the adapters off the 3’ and 5’ ends of the sequences
Do you remember how to download the Illumina Adapters? The URL is https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/SequenceQC/IlluminaAdapters.fa
Once you have downloaded the adapters, we are going to make a slurm script to run the command on the cluster
Use nano
(or vi
or emacs
) to edit a file, and copy this text:
#!/bin/bash
#SBATCH --job-name=fastp
#SBATCH -o fastp-%j.out
#SBATCH -e fastp-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB
mkdir fastp
fastp -n 1 -l 100 -i fastq/788707_20180129_S_R1.fastq.gz -I fastq/788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa
When fastp
runs, you will get an HTML output file called fastp.html. This shows some statistics about the run.
Filter out the human and non-human sequences.
The NCBI has several human genome versions specifically designed for inclusion in pipelines like this.
A. GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
A gzipped file that contains FASTA format sequences for the following:
1. chromosomes from the GRCh38 Primary Assembly unit.
Note: the two PAR regions on chrY have been hard-masked with Ns.
The chromosome Y sequence provided therefore has the same
coordinates as the GenBank sequence but it is not identical to the
GenBank sequence. Similarly, duplicate copies of centromeric arrays
and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked
with Ns (locations of the unmasked copies are given below).
2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
5. Epstein-Barr virus (EBV) sequence
Note: The EBV sequence is not part of the genome assembly but is
included in the analysis set as a sink for alignment of reads that
are often present in sequencing samples.
B. GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
6. alt-scaffolds from the GRCh38 ALT_REF_LOCI_* assembly units.
C. GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_full_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
D. GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
A gzipped file that contains all the same FASTA formatted sequences as
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz, plus:
7. human decoy sequences from hs38d1 (GCA_000786075.2)
For this work, we are going to use GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz which contains everything.
We have made this data available for you in /scratch/
Use minimap2 and samtools to filter the human sequences
We are going to make another slurm script, called mapping.slurm
:
nano mapping.slurm
And copy and paste these contents:
#!/bin/bash
#SBATCH --job-name=mapping
#SBATCH -o mapping-%j.out
#SBATCH -e mapping-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
mkdir -p bam/
minimap2 --split-prefix=tmp$$ -t 16 -a -xsr /scratch/courses01/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz \
| samtools view -bh | samtools sort -o bam/788707_20180129.bam
samtools index bam/788707_20180129.bam
Here is the samtools specification, and the description of the columns is on page 6.
Now, we use samtools
flags to filter out the human and not human sequences. You can find out what the flags mean using the samtools flag explainer
Again, edit a file, this time called filtering.slurm
nano filtering.slurm
And paste these contents:
#!/bin/bash
#SBATCH --job-name=filtering
#SBATCH -o filtering-%j.out
#SBATCH -e filtering-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
mkdir human not_human
samtools fastq -F 3588 -f 65 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R1.fastq.gz
samtools fastq -F 3588 -f 129 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R2.fastq.gz
# sequences that are not human
samtools fastq -F 3584 -f 77 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R1.fastq.gz
samtools fastq -F 3584 -f 141 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R2.fastq.gz
Using snakemake
In the above example, we started with two fastq files, removed adapter sequences, mapped them to the human genome, and then separated out the human and not human sequences.
We can combine all of that into a single snakemake
file, and it will do all of the steps for us.
See the Snakemake section for details on how to run these two commands in a single pipeline.
Read based annotations
In metagenomics, there are two fundemental approaches: read-based annotations and assembly based approaches. We are going to start with read based annotations.
Predicting the species that are in the sample
SingleM
take a look at the manual for detailed singlem
instructions.
Install singleM with mamba. Note: Here, we introduce named mamba
environments. What is the advantage of creating a named environment?
mamba create -n singlem -c bioconda singlem
mamba activate singlem
After installing singlem, you will get a warning from krona. DO NOT run the ktUpdate.sh
script. Instead, create a new symlink like so:
rm -rf /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy
ln -s /scratch/courses01/krona/taxonomy /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy
Next, before you use singlem, make sure you add this command:
export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'
Now run singleM on the CF data:
#!/bin/bash
#SBATCH --job-name=singlem
#SBATCH -o singlem-%j.out
#SBATCH -e singlem-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate singlem
export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'
singlem pipe -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -p output_profile.tsv --taxonomic-profile-krona krona.html --threads 16
Unzip the data
Note: Before we carry on, both focus
and super-focus
require that we unzip the data.
cd not_human
find . -name \*gz -exec gunzip {} \;
cd ..
Focus
Another way to identify the species present is to use FOCUS
We create a mamba environment just for focus:
mamba create -n focus -c bioconda focus
Now, we need to unpack the database. Here is a trick, since we don’t know exactly where the database is:
FOCUS=$(find software/miniforge3/envs/ -name db.zip -printf "%h\n")
unzip $FOCUS/db.zip -d $FOCUS
This should create the directory software/miniforge3/envs/focus/lib/python3.13/site-packages/focus_app/db/
with two files inside of it.
Now we can run focus on our data:
#!/bin/bash
#SBATCH --job-name=focus
#SBATCH -o focus-%j.out
#SBATCH -e focus-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate focus
focus -q not_human/ -o focus -t 16
SUPER-FOCUS
We are going to assess the functions using SUPER-FOCUS
We are going to make another mamba environment for super-focus:
mamba create -n superfocus -c bioconda super-focus mmseqs2
Now we can run super-focus on our data. Note: Superfocus creates a lot of data, and you will likely get an error if you just output the results to your home directory. In this command, we put the results somewhere else!
#!/bin/bash
#SBATCH --job-name=superfocus
#SBATCH -o superfocus-%j.out
#SBATCH -e superfocus-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate superfocus
export SUPERFOCUS_DB=/scratch/courses01/superfocus/
superfocus -q not_human/ -dir /scratch/courses01/$USER/superfocus -a mmseqs -t 16 -db DB_95
Recompressing the files.
Now that we are done with focus
and superfocus
, we can recompress the files. Question: Why should we compress the files (or not compress them)?
cd not_human
find . -type f -exec gzip {} \;
cd ..
Assembling the sequences
Note: Assembling may take a while, and for the workshops, Rob has already assembled the sequences. We may, however, assemble some of them depending on computational resources!
We will assemble with megahit.
We need to create a mamba environment for megahit … how are you going to do that?
#!/bin/bash
#SBATCH --job-name=megahit
#SBATCH -o megahit-%j.out
#SBATCH -e megahit-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate megahit
mkdir -p megahit_assembled/
megahit -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o megahit_assembled/788707_20180129_S -t 16
This generates a contig file called final.contigs.fa
.
Questions:
- How long is the longest sequence?
- How long is the shortest sequence?
- What are the N50, N75, and auN?
Counting the lengths of the sequences
We can use countfasta.py to count the lengths of the sequences:
python countfasta.py -f megahit_assembled/788707_20180129_S/final.contigs.fa
Making Sankey Plots
We can use the countfasta.py and the related countfastq.py scripts to make beautiful SanKey plots.
Open a text file, and by counting the sequences and the contigs, create some data that looks like:
fastq [1176881906] fastp
fastq [1425302] low quality
fastp [672903626] human
fastp [515964090] not human
not human [429542979] sequence similarity
not human [86421111] unknown
sequence similarity [7100783] Eukaryote
sequence similarity [330717020] Bacteria
sequence similarity [71361] Archaea
sequence similarity [1029932] Virus
sequence similarity [89215792] Multiclass
Then, we use the awesome sankeymatic to make our SanKey plots!
Mapping the reads to the contigs
For this patient, we sequenced several different samples. You can click on the links to get each of the samples
Sample | R1 or R2 | Number of sequences | Total length | Shortest | Longest | N50 | N75 | auN |
---|---|---|---|---|---|---|---|---|
788707_20171213_S | R1 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20171213_S | R2 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20180129_S | R1 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20180129_S | R2 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20180313_S | R2 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20180313_S | R1 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20181126_S | R1 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
788707_20181126_S | R2 | 125,000 | 37,500,000 | 300 | 300 | 300 | 300 | 300 |
BEFORE WE GO ON!
We are running out of space on /home
, where we logged in, so now we are going to work on /scratch
.
Here’s the hack to make life easy for you:
ln -s /scratch/courses01/$USER scratch
Now you see a directory called scratch
. Moving forwards, we are going to do all our work in there.
cd scratch
(Note the difference between /scratch
and scratch
)
Cross-assembly
We are going to run a cross assembly on this data to get more contigs. I’ve staged the data on /scratch/courses01/cf_data
.
Here is the code that we need to run this assembly
ALLR1=""; ALLR2="";
for R1 in $(find /scratch/courses01/cf_data/ -name \*R1\*); do
R2=${R1/R1/R2};
ALLR1+="$R1,";
ALLR2+="$R2,";
done;
ALLR1=$(echo $ALLR1 | sed -e 's/,$//');
ALLR2=$(echo $ALLR2 | sed -e 's/,$//');
megahit -1 $ALLR1 -2 $ALLR2 -o megahit_assembly -t 16
Note:
- Let’s walk through this code and see what it does!
- You need some slurm “directives” before you can start that code running.
Map the reads to all the contigs
We are going to use minimap
, like we did beore. However, here is a little bit of code that can run minimap
on all of the samples!
READDIR=/scratch/courses01/cf_data/
mkdir -p bam_contigs
for R1 in $(find $READDIR -name \*R1\* -printf "%f\n"); do
R2=${R1/R1/R2};
BAM=${R1/_R1.fastq.gz/.contigs.bam};
minimap2 --split-prefix=tmp$$ -t 8 -a -xsr megahit_assembled/cross_assembly/final.contigs.fa $READDIR/$R1 $READDIR/$R2 | samtools view -bh | samtools sort -o bam_contigs/$BAM;
done
find bam_contigs -type f -exec samtools index {} \;
Generating a depth profile
Now we can generate a depth profile for each contig in each of the four samples. Before we do this, take a look at the output from samtools coverage
from one .bam
file!
samtools coverage bam_contigs/788707_20171213_S.contigs.bam | less
Now we iterate over all the files and get the first column, the contig name, and the 7th column which has the mean depth for that contig.
mkdir bam_contigs_tsv
for BAM in $(find bam_contigs -type f -name \*bam -printf "%f\n"); do
OUT=${BAM/.contigs.bam/.tsv};
samtools coverage bam_contigs/$BAM | cut -f 1,7 > bam_contigs_tsv/$OUT;
done
(we might need a join step here! panic!! … here is the code to join lists)
Count the lengths of each of the contigs in the assembly.
Later, we are going to use the lengths of the contigs in the assembly. We can use our countfasta.py
program to do that.
python ~/countfasta.py -n -l -f megahit_assembly/final.contigs.fa > final.contigs.lengths.tsv
Identifying contigs that belong to the same genome
We are going to move the data to Google Colab to analyse the data and identify contigs that co-occur across multiple samples.
You can copy and paste these commands into Google Colab, and run this notebook to identify which contigs might belong together, eg. come from the same genomes
Step 1. Import some libaries
import os
import sys
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
If you have the data file locally, you can upload it to colab. Otherwise, you can connect Google drive and read the file from there!
# df = pd.read_csv('788707_20180129_S_coverage.tsv', sep="\t", index_col=0)
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('drive/MyDrive/Workshops/788707_20180129_S_coverage.tsv', sep="\t", index_col=0)
df
If Rob forget the part where you change the name of the contigs and remove the .tsv
from the samples (again), these two lines will help:
df.index.name = 'contigs'
df.columns = df.columns.str.replace('.tsv', '', regex=False)
read the lengths of the sequences
seqlengths = pd.read_csv('drive/MyDrive/Workshops/final.contigs.lengths.tsv', sep="\t", index_col=0, header=None, names=['contig', 'length'])
seqlengths
Filter the reads
In this example, we filter this data set to ensure that the sample 788707_20171213_S
has at least one hit.
dfs = df[df['788707_20171213_S.tsv'] > 0]
dfs
We are going to reshape this data frame so we can plot the samples on the x-axis and the depth on the y-axis.
melted_df = dfs.reset_index().melt(id_vars='contigs', var_name='Sample', value_name='Depth')
melted_df
And now we plot all the raw data.
sns.lineplot(data=melted_df, x='Sample', y='Depth', hue='contig', legend=False)
Filter for longer contigs
We filter our data frame so that we only consider contigs > 1000 bp. You can change this limit if you wish.
minlength = 1000
longcontigs = seqlengths[seqlengths['length'] > minlength].index
dfs = dfs[dfs.index.isin(longcontigs)]
Calculate the corrrelations
Now that we have the contigs and their average depth across the samples, we calculate a pairwise correlation between all contigs and all other contigs.
Note that the correlation can be either between contigs
or between samples
. Here, we use .T
to transpose the data frame because we want to correlate contigs.
correlation_matrix = dfs.T.corr()
correlation_matrix
Find the highest correlations
We filter all the correlations to find the highest correlations. I chose the 0.99 cutoff somewhat at random, and adjusting the cutoff may adjust the number of contigs in each correlation.
We just print the first ten correlations so we can see what we have!
threshold = 0.99
high_corr = []
for i in range(len(correlation_matrix)):
for j in range(i + 1, len(correlation_matrix)): # Avoid duplicate pairs
if abs(correlation_matrix.iloc[i, j]) > threshold:
high_corr.append((correlation_matrix.index[i], correlation_matrix.index[j], correlation_matrix.iloc[i, j]))
high_corr[0:10]
Find the longest contigs
We find the two longest contigs that are in our correlation matrix so we can plot their data.
The correlation matrix is a list of tuples
but we want to get a set
of IDs so that we only have unique IDs to check.
Next, we make an array of those sequence lengths that are in the set, and we sort the set by length, so we can see how long they are.
hc = set()
for i in high_corr:
hc.add(i[0])
seqlengths[seqlengths.index.isin(hc)].sort_values(by='length')
Find the contig that strongly negatively correlates with our long contig.
When we calculated the correlations we used the abs()
function, and so we have both positive and negative correlations.
- What do negative correlations mean?
- Are negative correlations ecologically meaningful?
Lets print the list of correlations from lowest to highest:
subset_corr = correlation_matrix.loc['k141_12474', list(hc)]
subset_corr.sort_values(ascending=True)
Now that we have two contigs, our long contig of interest and one that it is negatively related to, lets find their friends. In this code, adjust the 0.99 or 0.9 so that both positively_related
and negatively_related
have a “few” contigs each.
positively_related = correlation_matrix[correlation_matrix['k141_84018'] > 0.99].index
print(positively_related)
negatively_related = correlation_matrix[correlation_matrix['k141_37391'] > 0.9].index
print(negatively_related)
Now, we plot the data just like we did before. Here, I have only ploted the lines, but you can use hue
to plot the individual lines in the data
dfsubset1 = df[df.index.isin(positively_related)]
dfsubset2 = df[df.index.isin(negatively_related)]
melted_df1 = dfsubset1.reset_index().melt(id_vars='contig', var_name='Sample', value_name='Depth')
melted_df2 = dfsubset2.reset_index().melt(id_vars='contig', var_name='Sample', value_name='Depth')
fig, ax = plt.subplots(figsize=(8, 6))
sns.lineplot(data=melted_df1, x='Sample', y='Depth', c='b', estimator=None, units='contig', alpha=0.5, legend=False, ax=ax)
sns.lineplot(data=melted_df2, x='Sample', y='Depth', c='r', estimator=None, units='contig', alpha=0.5, legend=False, ax=ax)
Questions:
- What other analyses should we do on these contigs?
- This is only analysing a subset of the reads from one of the patients. What do you think will happen when we try and analyse all the reads from all the patients?
Viral Analysis With Hecatomb
We have already run the hecatomb pipeline for you, but you need to download the data to your computer before we can analyse the files.
There are three files:
Start a Google Colab Notebook
Connect to Google Colab
Load the data into pandas dataframes:
import os
import sys
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.stats.api as sms
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
Copy the files into your Jupyter notebooks and then find the files
os.chdir('Data')
os.listdir()
Read the data into pandas data frames
data = pd.read_csv('bigtable.tsv.gz',compression='gzip',header=0,sep='\t')
# without downloading - copy link address from above
# data = pd.read_csv('https://github.com/linsalrob/CF_Data_Analysis/raw/main/hecatomb/bigtable.tsv.gz?download=',compression='gzip',header=0,sep='\t')
metadata = pd.read_csv('CF_Metadata_Table-2023-03-23.tsv.gz',compression='gzip',header=0,sep='\t')
vmr = pd.read_csv('VMR_MSL39_v1.ascii.tsv.gz', compression='gzip',header=0,sep='\t')
Take some time to look at the tables.
How many columns are there in each? How many rows? What are the data types in the rows and columns?
First, filter the bigtable for just the Viruses
viruses = data[(data.kingdom == "Viruses")]
Group them by alignment type, length, and family
virusesGroup = viruses.groupby(by=['family','alnType','alnlen','pident'], as_index=False).count()
Now use seaborn to make some pretty plots!
First, we start by creating a style that we like.
#styling
sizeScatter = 10 * virusesGroup['count']
sns.set_style("darkgrid")
sns.set_palette("colorblind")
sns.set(rc={'figure.figsize':(12,8)})
# and now plot the data
g = sns.FacetGrid(virusesGroup, col="family", col_wrap=10)
g.map_dataframe(sns.scatterplot, "alnlen", "pident", alpha=.1, hue="alnType", sizes=(100,500), size=sizeScatter)
plt.legend(bbox_to_anchor=(5.0,1), loc=0, borderaxespad=2,ncol=6, shadow=True, labelspacing=1.5, borderpad=1.5)
plt.show()
That’s a lot of data!
Restrict this plot to just E value < 1 x 10-20
virusesFiltered = viruses[viruses.evalue<1e-20]
virusesGroup = virusesFiltered.groupby(by=['family','alnType','alnlen','pident'], as_index=False).count()
g = sns.FacetGrid(virusesGroup, col="family", col_wrap=10)
g.map_dataframe(sns.scatterplot, "alnlen", "pident", alpha=.1, hue="alnType", sizes=(100,500), size=sizeScatter)
for ax in g.axes.flat:
ax.tick_params(axis='both', labelleft=True, labelbottom=True)
ax.axhline(y=75, c='red', linestyle='dashed', label="_horizontal")
ax.axvline(x=150, c='red', linestyle='dashed', label="_vertical")
plt.legend(bbox_to_anchor=(5.0,1), loc=0, borderaxespad=2,ncol=6, shadow=True, labelspacing=1.5, borderpad=1.5)
plt.show()
More detailed analysis
There are two proteins that are bogus. We don’t know why, so we just ignore. For the next steps, we also only look at the amino acid hits, not the nucleotide hits.
data = pd.read_csv('bigtable.tsv.gz',compression='gzip',header=0,sep='\t')
blacklist = ['A0A097ZRK1', 'G0W2I5']
virusesFiltered = data[(data.alnType == "aa") & (data.kingdom == "Viruses") & (~data.targetID.isin(blacklist) & (data.evalue < 1e-20))]
Patient and Sample Date
Now we create new columns for the patient and the sample date
virusesFiltered[['patient', 'date', 'Sputum or BAL']] = virusesFiltered['sampleID'].str.split('_', expand=True)
Merge the real data and the host data
virusesFiltHost = pd.merge(virusesFiltered, vmr[['Family', 'Host source']], left_on="family", right_on="Family", how='left')
Take a look at this new table. What have we accomplished? What was done? How did it work?
Correct more errors
There are two families that are incorrectly annotated in this table. Let’s just manually correct them:
really_bacterial = ['unclassified Caudoviricetes family', 'unclassified Crassvirales family']
virusesFiltHost.loc[(virusesFiltHost['family'].isin(really_bacterial)), 'Host source'] = 'bacteria'
Group-wise data
Now lets look at some data, and put that into a group. We’ll also remember what we’ve looked
to_remove = []
bacterial_viruses = virusesFiltHost[(virusesFiltHost['Host source'] == 'bacteria')]
to_remove.append('bacteria')
Can you make a plot of just the bacterial viruses like we did?
How many reads map per group, and what do we know about them
host_source = "archaea"
rds = virusesFiltHost[(virusesFiltHost['Host source'] == host_source)].shape[0]
sps = virusesFiltHost[(virusesFiltHost['Host source'] == host_source)].species.unique()
fams = virusesFiltHost[(virusesFiltHost['Host source'] == host_source)].family.unique()
print(f"There are {rds} reads that map to {host_source} viruses, and they belong to {len(sps)} species ", end="")
if len(sps) < 5:
spsstr = "; ".join(sps)
print(f"({spsstr}) ", end="")
print(f"and {len(fams)} families: {fams}")
to_remove.append(host_source)
Can you repeat this for each of (plants
OR plants (S)
), algae
, protists
, invertebrates
, fungi
,
Filter out things we don’t want
Once we remove the above, what’s left?
virusesFiltHost = virusesFiltHost[~virusesFiltHost['Host source'].isin(to_remove)]
virusesFiltHost['Host source'].unique()
Now look at what we have
#filter
virusesGroup = virusesFiltHost.groupby(by=['family','alnlen','pident'], as_index=False).count()
#styling
sizeScatter = 10 * virusesGroup['count']
sns.set_style("darkgrid")
sns.set_palette("colorblind")
sns.set(rc={'figure.figsize':(20,10)})
g = sns.FacetGrid(virusesGroup, col="family", col_wrap=6)
g.map_dataframe(sns.scatterplot, "alnlen", "pident", alpha=.8, sizes=(100,500), size=sizeScatter)
for ax in g.axes.flat:
ax.tick_params(axis='both', labelleft=True, labelbottom=True)
ax.axhline(y=80, c='red', linestyle='dashed', label="_horizontal")
ax.axvline(x=150, c='red', linestyle='dashed', label="_vertical")
g.fig.subplots_adjust(hspace=0.4)
g.set_axis_labels("Alignment length", "Percent Identity")
plt.legend(bbox_to_anchor=(5.0,1), loc=0, borderaxespad=2,ncol=6, shadow=True, labelspacing=1.5, borderpad=1.5)
plt.show()
Now look at each family separately
virusesFiltHost[virusesFiltHost.family == 'Retroviridae']