Know Thyself: Using Data Science to Explore Your Own Genome

Know Thyself: Using Data Science to Explore Your Own GenomeDNA analysis with pandas and SeleniumLora JohnsBlockedUnblockFollowFollowingMay 29“Nosce te ipsum”, (“know thyself”), a well-known ancient maxim, frequently associated with anatomical knowledge.

Image from the University of Cambridge23andme once offered me a free DNA and ancestry test kit if I participated in one of their clinical studies.

In exchange for a cheek swab and baring my guts and soul in a score of questionnaires, I got my genome sequenced and gained access to myriad reports on where my ancestors were likely from, whom else on the site I might be related to, and what health conditions and traits I probably have inherited.

Seriously?23andme already provides an overwhelming amount of consumer-ready infographics and tools, but I knew I could do more with the data.

The intrepid may download their raw genetic data if they dare, so of course I poured it into pandas to see what I could make of it.

Looking at the .

txt file, I could see that I was missing some genotype values, denoted with ’ — ‘.

Most of the chromosomes are ints, but three are X, Y, and MT (for ‘mitochondrial’).

I needed to specify the data type properly so that pandas wouldn’t throw an error when it found mixed data in the input.

The other columns were fairly straightforward.

I also wanted pandas to ignore the prefatory comments at the beginning of the file that consisted of lines beginning with an octothorpe.

The arguments I needed to pass, therefore, were:separator (tab-delimited)dtype (as a dict)na_values (’ — ‘) (n.


: I decided against this in the end to avoid dealing with more NaNs)comment (’#‘)A quick note on the column names:rsid stands for Reference SNP cluster ID.

It identifies unique SNPs.

SNPs are Single Nucleotide Polymorphisms (‘snips’), locations in the genome that vary between individuals.

They can influence disease risk and drug effects, tell you about your ancestry, and predict aspects of how you look and act.

All humans have almost the same sequence of 3 billion DNA bases (A,C,G, or T) distributed between their 23 pairs of chromosomes.

But at certain locations, some differences exist that researchers have declared meaningful, for medical or other reasons (like genealogy).

I started to navigate my new DataFrame with basic exploratory data analysis and data cleaning.

I converted the letter chromosomes to numbers, cast them to ints, and created a dictionary to translate them back later so that I could better manipulate the data.

Some visualizationsRSIDs per chromosomeGetting data on SNPs from SNPediaTo acquire more information about my DNA, I pulled files from SNPedia, a wiki investigating human genetics that gathers extensive data and cites to peer-reviewed scientific publications.

SNPedia catalogues common, reproducible SNPs (or ones found in meta-analyses or studies of at least 500 patients), or those with other historic or medical significance.

The columns are:Unnamed: 0 (actually the SNP name)Magnitude (a subjective measure of interest)Repute (a subjective measure of whether the genotype is “good” or “bad” to have based on research, and blank for things like ancestry and eye color)Summary (a narrative description)Fun with regular expressionsTo align with my original DataFrame, I created a genotype column and used regex to separate out the genotype, which was stitched onto the end of the SNP name.

For consistency’s sake, I renamed the columns to match my original DataFrame and made sure the rsids were all lower-case.

I used regex to clean up the rsid a little more, too (because I will take any excuse to use more regex).

I overwrote the null reputes and summaries.

Merging my data with SNPediaAppropriately enough, I did an inner join of the SNPedia DataFrame on my DNA to see what data, if any, it had on my particular genotypes.

What’s hiding in there?I have plenty of “good” genotypes, but none with a nonzero magnitude.

I have three “bad” genotypes with a nonzero magnitude.

Sadly, I had no “interesting” genotypes above the threshold of 4, although hearteningly I did possess some slightly interesting bad ones.

Scrape relevant articles with SeleniumI decided I might like to read up on my bad genetics, so I used Selenium to scrape the abstracts of some scientific papers from PubMed.

For later hypochondriacal perusal, I exported my findings, complete with abstracts and hyperlinks, to a CSV file using the pandas DataFrame.

to_csv method.

Reading up on the medical literatureNow I have a handy CSV file, nicely formatted, with citations to scientific articles analyzing and describing my probably un-problematic, but probationally proditory genotypes.

Python provides prodigious tools to engage in literal introspection that the sawbones of old could never have imagined.

Originally published at https://www.



. More details

Leave a Reply