View on GitHub

ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.

Hands-on with Bin Chicken

Bin Chicken installation and setup

See https://aroneys.github.io/binchicken for more details.

mamba create -n binchicken -c bioconda -c conda-forge 'binchicken>=0.12.5'
conda activate binchicken
binchicken build \
  --conda-prefix /storage/data/.conda \
  --singlem-metapackage /storage/data/metapackage \
  --checkm2-db /storage/data/checkm2

Single-sample assembly with multi-sample binning

Start by running binchicken single to prepare the data to assemble each sample individually.

binchicken single \
    --forward 788707_20171213_S_R1.fastq.gz 788707_20180129_S_R1.fastq.gz 788707_20180313_S_R1.fastq.gz 788707_20181126_S_R1.fastq.gz \
    --reverse 788707_20171213_S_R2.fastq.gz 788707_20180129_S_R2.fastq.gz 788707_20180313_S_R2.fastq.gz 788707_20181126_S_R2.fastq.gz \
    --output single_assembly

The suggested assemblies with their respective binning samples can be found at single_assembly/coassemble/target/elusive_clusters.tsv. In this case, only two of the samples are considered likely to recover genomes. These samples are 788707_20180313_S and 788707_20180129_S. The other samples are probably too small (they are heavily subsampled) to recover genomes.

The actual assembly and binning can be run by adding --run-aviary. Note that with 1 core, the assemblies will take ~30 minutes each.

binchicken single \
    --forward 788707_20171213_S_R1.fastq.gz 788707_20180129_S_R1.fastq.gz 788707_20180313_S_R1.fastq.gz 788707_20181126_S_R1.fastq.gz \
    --reverse 788707_20171213_S_R2.fastq.gz 788707_20180129_S_R2.fastq.gz 788707_20180313_S_R2.fastq.gz 788707_20181126_S_R2.fastq.gz \
    --output single_assembly --run-aviary --cores 5

The assembly and binning for each sample is found at single_assembly/coassemble/coassemble/. Each sample should have a folder containing assemble for the assembly and recover for the binning. The bins for each sample are found in recover/bins, with genome info at recover/bins/bin_info.tsv.

The recovered bins are likely only ~40-60% complete, with fairly high contamination. This is probably due the small sample size, but the genome could still be analysed further with e.g. GTDBtk to find out their taxonomy.

Coassembly with multi-sample binning

Now that we have run single-sample assembly for the decent samples, we can run coassembly across the dataset. Because

binchicken coassemble \
    --forward 788707_20171213_S_R1.fastq.gz 788707_20180129_S_R1.fastq.gz 788707_20180313_S_R1.fastq.gz 788707_20181126_S_R1.fastq.gz \
    --reverse 788707_20171213_S_R2.fastq.gz 788707_20180129_S_R2.fastq.gz 788707_20180313_S_R2.fastq.gz 788707_20181126_S_R2.fastq.gz \
    --output coassembly --max-coassembly-samples 5

The suggested coassemblies with their respective binning samples can be found at coassembly/coassemble/target/elusive_clusters.tsv. Since they share single-copy marker genes, the samples 788707_20180129_S and 788707_20180313_S are suggested for coassembly.

We can run the actual coassembly and binning as before by adding --run-aviary. Note that with 1 core, the coassembly will take ~1 hour.

binchicken coassemble \
    --forward 788707_20171213_S_R1.fastq.gz 788707_20180129_S_R1.fastq.gz 788707_20180313_S_R1.fastq.gz 788707_20181126_S_R1.fastq.gz \
    --reverse 788707_20171213_S_R2.fastq.gz 788707_20180129_S_R2.fastq.gz 788707_20180313_S_R2.fastq.gz 788707_20181126_S_R2.fastq.gz \
    --output coassembly --max-coassembly-samples 5 --run-aviary --cores 5

References