Exploratory Data Analysis Tutorial in PythonChristopher BratkovicsBlockedUnblockFollowFollowingJun 16IntroductionOne of the most important skills that every Data Scientist must master is the ability to explore data properly.
Thorough exploratory data analysis (EDA) is essential in order to ensure the integrity of your gathered data and performed analysis.
The example used in this tutorial is an exploratory analysis of historical SAT and ACT data to compare participation and performance between SAT and ACT exams in different States.
By the end of this tutorial, we will have gained data-driven insight into potential issues regarding standardized testing in the United States.
The focus of this tutorial is to demonstrate the exploratory data analysis process, as well as provide an example for Python programmers who want to practice working with data.
For this analysis, I examined and manipulated available CSV data files containing data about the SAT and ACT for both 2017 and 2018 in a Jupyter Notebook.
Exploring data through well-constructed visualizations and descriptive statistics is a great way to become acquainted with the data you’re working with and formulate hypotheses based on your observations.
Exploratory Data Analysis (EDA) Objectives1) Quickly describe a dataset; number of rows/columns, missing data, data types, preview.
2) Clean corrupted data; handle missing data, invalid data types, incorrect values.
3) Visualize data distributions; bar charts, histograms, box plots.
4) Calculate and visualize correlations (relationships) between variables; heat map.
Benefits of a Data-Driven ApproachIt’s no secret that standardized testing procedures have been a controversial topic for many years now.
By conducting my own preliminary research, I discovered some evident issues with the SAT and ACT exams fairly quickly.
For example, some states require only the SAT, only the ACT, both of the exams, alternative standardized exams, or each student is required to take one standardized exam of their choosing.
This variability between standardized testing expectations, set by each state, should be considered a significant source of bias for exam records between states, such as participation rates and average performance.
Research may be important but taking a data-driven approach to support claims (hypotheses) made on the basis of qualitative research is essential.
Taking a data-driven approach allows for validation of previously made claims/hypotheses, as well as development of new insights based on thorough examination and manipulation of the data.
Getting StartedFeel free to download my code and/or data to follow along with the tutorial at this link to my GitHub: https://github.
com/cbratkovics/sat_act_analysisThe first step in exploring data using Python is to make sure you import the proper libraries.
For this introduction, the libraries we will need are NumPy, Pandas, Matplotlib, and Seaborn.
When importing a library, you can assign it an alias to reduce the amount of typing required to use each library’s properties.
The code below shows the necessary import statements:Using the Pandas library, you can load a data file into a container object, known as a data frame.
As the name suggests, this type of container is a frame that holds the data read into it using the Pandas method, pd.
read_csv(), which is specific to CSV files.
The conversion of each CSV file into Pandas data frame objects is shown below:Exploring the Data & Cleaning Corrupted DataLearning about the data you’re working with is essential when performing exploratory analysis.
Luckily, data frame objects have many useful attributes that make this easy.
A standard practice when comparing data among multiple datasets is to check the number of rows and columns in each data frame using the attribute, .
shape, like so:Note that number of rows is on the left and number of columns is on the right; (rows, columns).
Our first problem with the data is inconsistent dimensions between the ACT 2017 and ACT 2018 data frames.
Let’s get a better look at the data using the .
head() method, which displays the first five rows, first five index values, of every column within a Pandas data frame object.
I’ll use the 2018 ACT as an example:After previewing the first five rows of the other data frames, we can infer that there is probably an issue with how Sates were input into the dataset.
Since there are 51 United States, there is most likely incorrect and/or duplicate value(s) in the ACT 2017 and ACT 2018 ‘State’ columns.
However, we can’t be certain of this kind of inference when working with data.
We need to examine the data in question to identify the exact problem.
First, let’s check the values of the ‘State’ column in the 2018 ACT data using the .
value_counts() method, which displays the number of times each specific value in a data frame is present in descending order:Notice the value, ‘Maine’, is represented twice in the 2018 ACT data.
The next step is to determine if the values are duplicates or the data was input incorrectly.
We’ll do so with a technique known as masking, which allows us to examine the rows within a data frame that meet specified criteria.
For example, let’s use masking to look at all rows in the 2018 ACT data that have a ‘State’ value of ‘Maine’:Now, the corrupted value has been confirmed as a duplicate entry.
Thus, we can simply delete the value, using the .
drop() method, and reset the data frame indexes, using the .
reset_index() method, to solve the issue:Notes: act_18.
index specifies which index to drop, inplace=True saves the change to the original data frame object without reassigning it to the original data frame object (act_18).
Now that we’ve solved the problem of inconsistent number of rows between the ACT data frames, there’s still the issue of an inconsistent rows between the SAT and ACT data frames (52 for ACT and 51 for SAT).
In order to compare SAT and ACT data between states, we need to ensure each state is represented equally in each data frame.
This is your opportunity to get creative and think of a way to retrieve ‘State’ column values between data frames, compare the values, and display the result.
My solution is shown in the function below:The function, compare_values(), takes a column from two different data frames, stores the values temporarily, and displays any values that only appear in one of the datasets.
Let’s see how it works when comparing SAT/ACT ‘State’ column values for 2017 and 2018:All right!.Now we know that the ‘National’ value in the ‘State’ columns in both ACT data frames needs to be removed.
This can be accomplished using the same code we used to locate and remove the duplicate ‘Maine’ value in the ACT 2018 data frame:However, there is another error regarding the values ‘Washington, D.
’ and ‘District of Columbia’ in the SAT 2018 and ACT 2018 data.
We need to decide on a consistent value to represent Washington, D.
/District of Columbia among the four data frames.
The choice you make doesn’t really matter between either option, but it’s good practice to choose the name that has the highest presence among the data frames.
Since the only difference between the SAT 2017 and ACT 2017 ‘State’ data is the ‘National’ value, we can assume either ‘Washington, D.
’ and ‘District of Columbia’ is consistent among the ‘State’ columns in both data frames.
Let’s use the masking technique to check which of the values ‘Washington, D.
’ and ‘District of Columbia’ is in the ACT 2017 ‘State’ column:Now we officially have enough evidence to justify replacing the ‘Washington, D.
’ value with ‘District of Columbia’ in the ACT 2018 data frame.
Using the Pandas data frame .
replace() method, we can do just that.
Then, we can confirm our changes were successful using the compare_values function:Success!.Values are now consistent among states in each data frame.
Now we can address the issue of inconsistent number of columns between the ACT datasets.
Let’s compare the column names among each of the data frames using the .
columns attribute:Note that adding the “!.” expression prints a new line after the output of print() is displayed.
Since the goal of this analysis is to compare SAT and ACT data, the more similarly we can represent each dataset’s values, the more helpful our analysis will be.
Therefore, the only columns I’ll keep in each of the data frames are ‘State’, ‘Participation’, ‘Total’ (SAT only), and ‘Composite’ (ACT only).
Take note that if you’re goal for the analysis were different, such as to compare SAT performance between 2017 and 2018, keeping the data specific to each performance category (e.
Math) would be essential.
Staying aligned with the task at hand, we can remove the excess columns using the .
drop() method, like so:Now the data frames all have the same dimensions!.Unfortunately, there is still much to be done.
Let’s see if any data is missing and take a look at the data types of all the data frames:Check for missing data using .
sum()Check data types with .
dtypesThe good news is that no nonexistent values are present in the data.
The bad news is the error in data types, specifically that the ‘Participation’ column in each of the data frames is of object type, meaning it’s considered to be a String.
This is problematic because many useful visualizations to observe when exploring data require numeric type variables to function, such as correlation heat maps, box plots, and histograms.
The same problem is seen in the ‘Composite’ columns in the two ACT data frames.
Let’s look at the first five rows of the SAT 2018 and ACT 2018 data:First 5 rows of SAT 2018 data.
First 5 rows of ACT 2018 data.
You can see that ‘Composite’ and ‘Participation’ should be of float type.
Good practice is keeping the types of numerical data you want to compare consistent, so it’s acceptable to convert ‘Total’ to float type as well without hurting the integrity of the data (integer = 1166, float = 1166.
The first step in this type conversion is removing the ‘%’ character from each of the ‘Participation’ columns so they can be converted into floats.
The next step will be converting all data into floats except for the ‘State’ column in each data frame.
This can be tedious, giving us another great opportunity to create a function to save time!.My solution is shown in the functions below:Note: https://stackabuse.
com/lambda-functions-in-python/ is a great resource to learn more about using lambda functions in Python 3.
Time to put these functions to work.
First let’s use the fix_participation() function:Now we can convert the data types of all columns using the convert_to_float() function:But wait!.Running the convert_to_float() function should have thrown an error.
Error messages can be very helpful depending on what IDE you’re using.
In Jupyter Notebook, the error clearly directs you to the ‘Composite’ column in the ACT 2017 data frame.
To look at the values more closely, use the .
value_counts() method:It looks like our culprit is a stray ‘x’ character in the data, most likely a result of incorrect input when entering the data into the original file.
To remove it, use the .
strip() method within the .
apply() method, like so:Great!.Now try running this code again and all the data will be of the correct types:The final step before we can start visualizing the data is to merge the data into a single data frame.
To accomplish this, we need to rename the columns in each data frame to be descriptive of what they each represent.
For example, a good name for the ‘Participation’ column in SAT 2018 would be ‘sat_participation_17’.
When the data is merged, this name is much more descriptive.
Another note is the underscore notation to eliminate tedious spacing errors when accessing values, and lowercase convention to speed up typing.
Naming conventions for data is up to the developer, but many people consider these to be good practice.
You can rename columns like so:In order to merge the data with no errors, we need to align the indices of the ‘state’ columns to be consistent between data frames.
We do so by sorting by the ‘state’ column in each data frame, then resetting the index values in order starting at zero:Finally, we can merge the data.
Rather than merging all four data frames at once, I merged two at a time by year and confirmed no errors presented themselves for each merge.
Below is the code for each merge:2017 SAT & ACT merged data frame.
2018 SAT & ACT merged data frames.
Final merge of data frames.
Once you’ve cleaned your data, it’s a good idea to save it so you don’t have to go through the process of cleaning it again.
Using the pd.
to_csv() method from Pandas:Setting index = False saves the data without its index values.
Time to visualize the data.Now we can use Matplotlib and Seaborn to take a closer look at our clean and combined data frame.
When examining histograms and box plots, I’ll be focusing on visualizing the distribution of participation rates.
When examining heat maps, the relationships between all data will be considered.
Visualize Data Distributions — Seaborn HistogramHistograms represent the frequency that numeric data values appear within a specified range within a dataset (e.
how many values in the data fall within the range of 40%-50%).
From the histograms, we can notice that there are more States with 90%–100% participation rates for the ACT in 2017 and 2018.
Conversely, there are more States with 0%-10% participation rates for the SAT in 2017 and 2018.
We can infer that higher frequency of States with 90%-100% ACT participation rates could be caused by some regulation that requires taking the ACT.
Visualize Data Distributions — Matplotlib Box PlotBox plots represent the spread of the data, including minimum, maximum, and the interquartile range (IQR).
Interquartile range consists of the first quantile, median, and 3rd quantile.
Based on the box plots above, we can see that SAT participation rates increases overall from 2017 to 2018.
Another thing we can notice is the consistency between ACT participation rates from 2017 to 2018.
This raises the question of why SAT participation rates increased overall, even though ACT participation rates didn’t change significantly.
Calculate and Visualize Correlations — Seaborn Heat MapStronger relationships in are represented by values in the heat map that are closer to negative or positive one.
Weaker relationships are represented by values that are closer to zero.
Positively related variables, values with correlations between zero and positive one, indicate that one variable increases as the other variable increases.
Negatively related variables, values with correlations between negative one and zero, indicate that one variable decreases as the other variable increases.
Variables with strong relationships that should be further examined include SAT participation in 2017 and SAT participation in 2018, ACT composite score in 2017 to ACT participation in 2017, and ACT participation in 2018 and SAT participation in 2018.
There are many more relationships that should be examined further, but these are good starting points to guide research into why these relationships exist as they are.
ConclusionsThorough exploratory data analysis ensures your data is clean, useable, consistent, and intuitive to visualize.
Remember, there is no such thing as clean data, so exploring the data before you start working with it is a great way to add integrity and value to your data analysis process before it even starts.
By using strong exploration of your data to guide outside research, you will be able to derive provable insights effectively and efficiently.