Using NCBI edirect to access data at NCBI

NCBI has developed the edirect command line application for interfacing with their system. We can use that to query NCBI from the command line.

For many of these examples, check out the EDirect Cookbook and the NCBI EDirect book

For all of the eutils, you can add the option -help to the command to get more information about that command. For example, efetch -help will give you lots of information about the efetch command.

For each of these commands you need to tell NCBI what your email address is. The AWS instances have a generic address of ec2-user@ip-xxx-xxx-xxx-xxx.us-east-2.compute.internal but obviously that is not meaningful. NCBI requires an email address so they can contact you if you write a script that goes off the rails and starts DOS attacking them (also, as I know from personal experience, they will simply block your IP so you can no longer access their services!)

NCBI Databases

Before we start, we need to know which databases are available. You can find a list of databases using einfo:

einfo -email xxx@sdsu.edu -dbs | sort

Note: you need to specify your own email address here!


annotinfo	gapplus	nlmcatalog	probe
assembly	gds	nuccore	protein
biocollections	gencoll	nucest	proteinclusters
bioproject	gene	nucgss	pubmed
biosample	genome	nucleotide	pubmedhealth
biosystems	geoprofiles	omim	seqannot
blastdbinfo	grasp	orgtrack	snp
books	gtr	pcassay	sparcle
cdd	homologene	pccompound	sra
clinvar	medgen	pcsubstance	structure
clone	mesh	pmc	taxonomy
dbvar	ncbisearch	popset	unigene
gap

We can find out more about any of these databases using the einfo command again:

einfo -email xxx@sdsu.edu -db assembly | less

However, this prints out XML, which is not very readable. Instead, eutils from NCBI comes with a program called xtract that converts XML into text. We can use that to extract the name and description associated with the fields in a database:

einfo -email xxx@sdsu.edu -db assembly -fields

einfo -email xxx@sdsu.edu -db assembly | xtract -pattern Field -element Name Description

You can do this with any of the databases listed above to see what fields those databases have. This is the output from the assembly database.

Name	Description
ALL	All terms from all searchable fields
UID	Unique number assigned to publication
FILT	Limits the records
ACCN	Chromosome accessions
ASAC	Space delimited assembly accessions w/ & w/o versions
ASLV	How assembled is this assembly. ‘Contig’ to ‘Chromosome’
TXID	Taxonomy ID
ORGN	Exploded organism names
RUID	Id of RefSeq Assembly.
GUID	Id of GenBank synonym of this Assembly.
UIDS	Pair-id, GB-id, and RS-id of this Assembly.
PROJ	Uid and accessions of this assembly’s projects
SAMP	BioSample Accession and Id
NAME	Assembly name
ALLN	All names, space separated
DESC	Assembly description
COV	Sequencing coverage
TYPE	Type of the assembly
SRDT	Date the most recent sequence went live in ID
UPDT	Date the assembly was last updated
LEN	Total length of chromosome/genome including bases and gaps divided by 1,000,000.
REPL	Number of chromosomes in assembly
PLAC	Number of placed scaffolds
UNLO	Number of unordered(unlocalized) scaffolds belonging to chromosomes
UNPL	Number of unplaced scaffolds which do not belong to any chromosome, ie ChrUn
CN50	Contig length at which 50% of total bases in assembly are in contigs of that length or greater
SN50	Scaffold length at which 50% of total bases in assembly are in contigs of that length or greater
CL50	Number of contigs that are greater than or equal to the N50 length.
SL50	Number of scaffolds that are greater than or equal to the N50 length.
CNTG	Number of contigs
UNGL	Total length excluding gaps in chromosome/genome divided by 1,000,000
PROP	Properties
SUBO	Organization that submitted this assembly
INFR	Infraspecific Name: breed, cultivar, strain, ecotype
ISOL	Isolate name
SEX	Sex
ASMM	Assembly Method
GCOV	Genome Coverage
TECH	Sequencing Technology
EXFV	Expected Final Version
RGAS	Reference Guided Assembly
SCAM	Single Cell Amplification
RCAT	RefSeq Category
FTYP	From Type Material
NFRS	Reasons assembly was excluded from RefSeq
GRLS	Date the GenBankassembly was first released
RRLS	Date the RefSeq assembly was first released
RTYP	Release Type
RLEN	Total length of chromosome/genome including bases and gaps

Downloading Genomes

Let’s start with a simple search. The command esearch allows us to search the databases. For example to search through the assembly database, we can use:

esearch -db assembly -query "Faecalibacterium prausnitzii[ORGN]"

This should give you an answer like:

<ENTREZ_DIRECT>
  <Db>assembly</Db>
  <WebEnv>NCID_1_27641186_130.14.18.34_9001_1536599806_648492441_0MetA0_S_MegaStore</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>49</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

The key field in this response is the <Count>49</Count> field – this shows you how many things match your query,

We can use efetch to get those matches to the query, and we use the document summary format of efetch to summarize the document.

esearch -db assembly -query "Faecalibacterium prausnitzii[ORGN]" | efetch -format docsum | less

This is still XML format, and so to extract specific elements from that output, we can use xtract again. Here, we are looking for elements that contain a link to the NCBI ftp site:

<FtpPath_GenBank>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/467/805/GCA_003467805.1_ASM346780v1</FtpPath_GenBank>
<FtpPath_RefSeq>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1</FtpPath_RefSeq>
<FtpPath_Assembly_rpt>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1/GCF_003467805.1_ASM346780v1_assembly_report.txt</FtpPath_Assembly_rpt>
<FtpPath_Stats_rpt>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1/GCF_003467805.1_ASM346780v1_assembly_stats.txt</FtpPath_Stats_rpt>

This is the path to different formats of the file We want the annotated sequences in [RefSeq](, and you can view that path in your browser by looking at this link: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1/.

For these entries, the RefSeq file will be called the same as the name of the directory, with “_genomic.fna.gz” appended.

Thus, for the above entry, the RefSeq nucleotide sequence file is: ` GCF_003467805.1_ASM346780v1_genomic.fna.gz

This xtract syntax will get just the URLs for all the entries:

xtract -pattern DocumentSummary -element FtpPath_RefSeq

Which gives us a list like:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/465/525/GCF_003465525.1_ASM346552v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/434/165/GCF_003434165.1_ASM343416v1
...

Note: I just trimmed this to the first three elements in the list

We can write a simple program using awk to append the directory name and _genomic.fna.gz on the end of these URLs:

esearch -db assembly -query "Faecalibacterium prausnitzii[ORGN]" | efetch -format docsum | \
xtract -pattern DocumentSummary -element FtpPath_RefSeq | \
awk -F"/" '{print $0"/"$NF"_genomic.fna.gz"}'

In this awk command, $0 is the thing that was piped to the command. $NF is the last element in the record and we have said to split with a / using the option -F"/".

So that command gives us this list (again trimmed to the first three entries):

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/467/805/GCF_003467805.1_ASM346780v1/GCF_003467805.1_ASM346780v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/465/525/GCF_003465525.1_ASM346552v1/GCF_003465525.1_ASM346552v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/434/165/GCF_003434165.1_ASM343416v1/GCF_003434165.1_ASM343416v1_genomic.fna.gz

There are several programs that you can use to download these files. One I recommend is wget as it is very straightforward. However, I also recommend using curl as it has a lot more options!

For example, we can modify the above command to use curl to get the files and save them with their original names:

esearch -db assembly -query "Faecalibacterium prausnitzii[ORGN]" | efetch -format docsum | \
xtract -pattern DocumentSummary -element FtpPath_RefSeq | \
awk -F"/" '{print "curl -o "$NF"_genomic.fna.gz " $0"/"$NF"_genomic.fna.gz"}'

Downloading Reference Databases

If you are working with shotgun metagenomic data you will eventually have to make sense of that data in a taxonomic and/or functional context. Therefore knowing how to access large amounts of reference databases quickly and efficiently is a must. Below are some examples of how you can accomplish this.

Downloading the NCBI Virus RefSeq Project Database

esearch -db bioproject -query "PRJNA485481" | elink -target nuccore | efetch -format fasta  > refseq_virus.fna

If you click on the NCBI virus link above you will see that accession number will be the query name you use to download the viral database which in this case is *PRJNA485481.

You could also download this same database using the database id:

esearch -db bioproject -query "485481[id]" | elink -target nuccore | efetch -format fasta > reseq_virus.fna

Also note that I wanted to get the nucleotide database so my link target is nuccore but you can easily change that to protein if that is your desired database like the example below.

Downloading the Bacterial Antimicrobial Resistance Reference Gene Database from NCBI

esearch -db bioproject -query "PRJNA313047" | elink -target protein | efetch -format fasta > ncbi_antibiotic_database.faa

Downloading all Bacterial genomes in RefSeq Your query can also be a list of filters allowing you to download only certain datasets from NCBI - in this case it is all bacterial nucleotide genomes found in the RefSeq database.

esearch  -db "nucleotide" -query "Bacteria[Organism] OR bacteria[All Fields] AND Refseq[Filter]" | efetch -format fasta 

ComputationalGenomicsManual

Robs manual for the computational genomics and bioinformatics class.

Using NCBI edirect to access data at NCBI

NCBI Databases

Downloading Genomes

Downloading Reference Databases