Bioinformatics Programming with Biopython — Accessing NCBI Nucleotide DatabasesBee Guan TeoBlockedUnblockFollowFollowingFeb 5National Center for Biotechnology Information (NCBI) is a series of databases which store molecular and bibliographic data.
Since its inception in 1988, NCBI has attracted many researchers to access its molecular data (e.
DNA sequence, RNA sequence, protein sequence) and the relevant bibliographic data.
One can access the data using Entrez, a data retrieval system that provides users access to NCBI’s databases.
Alternatively, one can also choose to make use of the Entrez Programming Utilities (known as EUtils) to make queries in the databases.
This programmatic approach can easily be done by using Biopython.
Biopython enables researchers to search the NCBI databases and download the records in a Python script.
In the following sections, I will offer a step-by-step guideline on the usage of Biopython to make a query in an NCBI database.
Setting Up BiopythonEnsure either Python 2 or Python 3 is ready on your computer (Python 3 is recommended to follow the steps given in this guideline).
If not, you can easily get the installer and follow the installation steps from here.
There are several ways to install Biopython.
However, I would recommend PIP.
Just open Terminal (or Command Prompt in Windows) and typepip install biopythonWait for a few moments until the installation is completed and you are done with the step to setup Biopython on your machine.
Programming EditorTo start writing your Python code, Jupyter Lab is highly recommended as it offers an interactive coding environment where you can do live coding and visualize your output in each section of your code.
For more info, you could look out this.
To install Jupyter Lab, open Terminal (or Command Prompt in Windows) and typepip install jupyterlabIn the same Terminal (or Command Prompt in Windows), typejupyter labThe command above will launch the Jupyter Lab in your default browser as shown below:Jupyter Lab InterfaceNext, choose Python 3 NotebookChoose Python 3 NotebookRename the notebook fileYou are ready to start writing Python code in Jupyter Lab.
Start codingIn the first cell of the notebook, import the Entrez and SeqIO modules from Biopythonfrom Bio import Entrezfrom Bio import SeqIONext, create a new cell in the notebook and set an email parameter using one of your accounts.
This step is optional but is recommended to enable NCBI to contact you if there is an issue with the usage of Entrez.
email = "teobguan2004@gmail.
com"Now, let presume we are looking for “accD” gene from E.
Coli using the Entrez, we can use Entrez.
esearch for this purpose.
You would have to set ‘db’ and ‘term’ parameters inside the esearch function.
“db” refers to the specific NCBI database where the query is made whereas ‘term’ denotes the query texts which are composed of gene name and organism.
Besides, we can also define the maximum returned record references by using “retmax” as an optional parameter.
In this case, we set it at 20.
We store all the retrieved info into “result_list” variable.
handle = Entrez.
esearch(db='nucleotide', term='accD[Gene Name] AND "E.
coli"[Organism]', , retmax="20")result_list = Entrez.
read(handle)You can view the retrieved sequence id by accessing the “IdList” property of the returned result.
Only twenty sequence ids are returned.
As you may notice, we can identify the total number of sequence id in the NCBI Nucleotide database using “Count” property which returns 6296.
This means only 20 ids out of 6296 ids are stored in the id_list.
If you wish to have more ids, you can adjust the “retmax” parameter in Entrez.
id_list = result_list['IdList']count = result_list['Count']print(id_list)print(".")print(count)Now, it’s time to download the full sequence records from the NCBI database using Entrez.
I just simply use one of the sequence id “1563118780” to set the id parameter in efetch function.
You can also set the file type for the sequence record using the “rettype” parameter.
In my case, I set it as “gb” which is a genbank file.
handle2 = Entrez.
efetch(db='nucleotide', id='1563118780', rettype='gb')At last, use SeqIO.
read function to read the sequence record from the file handle and assign it to a variable (e.
You are ready to access the sequence info by addressing its properties such as id, name, description or seq.
seq_record = SeqIO.
seq))Sample OutputThis is a simple guideline to show one example of how we can use Biopython for a very simple Bioinformatic work.
There are much more functionalities available in Biopython which I will demonstrate them one by one in the future post.