Comparing and Matching Column Values in Different Excel Files using Pandas

Comparing and Matching Column Values in Different Excel Files using PandasStephen FordhamBlockedUnblockFollowFollowingMay 3Pandas for column matchingOften, we may want to compare column values in different Excel files against one another to search for matches and/or similarity.

Using the Pandas library from Python, this is made an easy task.

To demonstrate how this is possible, this tutorial will focus on a simple genetic example.

No genetic knowledge is required!Consider the following situation; I have some unknown samples of DNA sequences (simple strings composed of the letters A,T,G and C.

For each of these unknown DNA sequences I assign a unique identifier.

To make things simple, the first unknown sequence will be identified as 1, the second as 2 and so forth, as depicted below.

In this somewhat contrived example, I want to compare my unknown sequences against known references sequences.

In short, my question is, do any of the ‘Unknown_sample_no’ match against my ‘Reference_sequences_ID’?.If they do, which sequence(s) do they match up with.

To begin answering this question, I first import the pandas module, and read my corresponding files as csv files.

Note: I saved these files in Excel as comma separated value files (csv files), and used the read_csv() function to parse them.

This is not strictly necessary, but a working habit I prefer.

You could alternatively leave your Excel files with the native .

xlsx extension, and use the pandas.

read_excel() function to save a step here.

The query sequences DataFrame were assigned the variable A, and the sequence references DataFrame were assigned the variable B.

To validate things are working, I parse the DataFrames checking specifically for their columns.

I then convert both columns in both DataFrames into python lists.

To do this I use the .

tolist() method on a specified column of a particular DataFrame.

By way of an example, the column ‘Unknown_sample_no’ in DataFrame A is converted to a list.

I perform this step for each of the other three columns spread across the two Excel files as shown in the code snippet below.

I want to keep the association between the ‘Unknown_sample_no’ and its corresponding ‘Unknown_sample_seq’.

Furthermore, I also want to keep the association between the ‘Reference_sequences_ID’ and their corresponding ‘Reference_sequences’.

Python offers an easy way to maintain this association.

I convert my lists into 2 dictionaries.

I use the zip function to join the lists and then use the dict function to convert them in appropriately assigned dictionaries.

To confirm the dictionaries have been made correctly, I run the script in my terminal.

A brief validation check informs me that things are workings as intended.

From the Ref_dict for example, the ‘Reference_sequences_ID’ keys match up correctly to their corresponding ‘Reference_sequences’ values: ‘{‘A’: ‘AAAAGCGCGAGGGGGGA’, ‘K’: ‘GGGAGAGAGGG’, ‘Y’: ‘CGGAGCGTTT’….

}I can now compare my 2 dictionaries against one another.

To do so, I use a for loop to iterate though my ‘Unknown_dict’ and a nested for loop to iterate through the Ref_dict.

At each iteration, I want to know whether the sequence in the Unknown_dict matches to any of the sequences in the Ref_dict (some will, I deliberately included 8 matches).

To check for matches, I use the re.

search() function from the re module (regular expression module).

When there is a match, I think it would be useful to know, what sequence matched, where that match appeared in the DNA sequence, and most importantly what ‘Unknown_sample_no’ matches to which ‘Reference_sequences_ID’.

To make things explicitly clear, it would be ideal if all the matches were written to a single csv file which could be opened in Excel.

Ideally, the ambition is to make it clear to anyone who wants to interpret the data, which sequences match, what the matching sequence is and where did it match.

To achieve this goal, I write to a file titled, ‘seq_match_compare.

csv’, and voila, when I open the csv file, all the information I want appears in an interpretable way!I now know for example, that Query_ID: 1 corresponds to Ref_species A, and this match starts at position 4 in the sequence.

There are multiple ways to compare column values in 2 different excel files.

The approach here checks each sequence in the Unknown seq column from the first file with each sequence in the Reference_sequences column from the second file.

This way represents a simple way to match and compare, and offers great scalability if we want to analyse any number of samples!All made possible by using Pandas!.

. More details

Leave a Reply