Introduction to Exploratory Data Analysis (EDA)

Could engine size possibly predict the price of the car?A great way to visualize this relationship would be to use a scatter plot.

Scatter plots represent each relationship between two continuous variables as individual data point in a 2D graph.

We will use the scatter() method of matplotlib library to visualize the scatter plot.



xlabel('Engine Size')plt.


show()From the above output, we can interpret that there is a linear relationship between engine size and price.

Cars with bigger engines might be costlier than the cars with small-sized engines.

This thing totally makes sense, right?HistogramsHistogram shows us the frequency distribution of a variable.

It partitions the spread of numeric data into parts called as “bins” and then counts the number of data points that fall into each bin.

So, the vertical axis actually represents the number of data points in each bin.

Let’s see an example of this.

We will see the distribution of “peak-rpm” using histogram.

count,bin_edges = np.



xlabel('Value of peak rpm')plt.

ylabel('Number of cars')plt.


show()The above output tells us that there are 10 cars which have peak rpm between 4395 and 4640, around 42 cars have peak rpm between 4640 and 4885 and so on.

Grouping of dataAssume that you want to know the average price of different types of vehicles and observe how they differ according to body styles and number of doors.

A nice way to do this would be to group the data according to “body-style” and “num-of-doors” and then see the average price across each category.

The groupby() method from Pandas library helps us to accomplish this task.

df_temp = df[[‘num-of-doors’,’body-style’,’price’]]df_group = df_temp.


mean()The above output tells us that two door hardtop and two door convertibles are the most expensive cars, whereas, four door hatchbacks are the cheapest.

A table of this form is not very easy to read.

So, we can convert this table to a pivot table using the pivot() method, which would allow us to read this table in a better fashion.

df_pivot = df_group.

pivot(index='body-style',columns='num-of-doors')The price data now becomes a rectangular grid, which is easier to visualize.

Handling missing valuesWhen no data value is stored for a feature in a particular observation, we say this feature has missing values.

Examining this is important because when some of your data is missing, it can lead to weak or biased analysis.

We can detect missing values by applying isnull() method over the dataframe.

The isnull() method returns a rectangular grid of boolean values which tells us if a particular cell in the dataframe has missing value or not.

As you can see, analyzing a grid of this style to detect missing value is not very convenient, so we will use heatmaps to visually detect these missing values.

HeatmapHeatmap takes a rectangular data grid as input and then assigns a color intensity to each data cell based on the data value of the cell.

This is a great way to get visual clues about the data.

We will generate a heatmap of the output of isnull() in order to detect missing values.




show()This indicates that “stroke” and “horsepower-binned” columns have few missing values.

We can handle missing values in many ways:Delete: You can delete the rows with the missing values or delete the whole column which has missing values.

The dropna() method from Pandas library can be used to accomplish this task.

Impute: Deleting data might cause huge amount of information loss.

So, replacing data might be a better option than deleting.

One standard replacement technique is to replace missing values with the average value of the entire column.

For example, we can replace the missing values in “stroke” column with the mean value of stroke column.

The fillna() method from Pandas library can be used to accomplish this task.

Predictive filling: Alternatively, you can choose to fill missing values through predictive filling.

The interpolate() method will perform a linear interpolation in order to “guess” the missing values and fill the results in the dataset.

ANOVA (Analysis of Variance)ANOVA is a statistical method which is used for figuring out the relation between different groups of categorical data.

The ANOVA test, gives us two measures as result:F-test score: It calculates the variation between sample group means divided by variation within sample group.

P value: It shows us the confidence degree.

In other words, it tells us whether the obtained result is statistically significant or not.

Let’s take an example to understand this better.

The following bar chart shows the average price of different car makes.

Average price for different makesWe can see that the average price of “audi” and “volvo” is almost same.

But, the average price of “jaguar” and “honda” differ significantly.

So, we can say that there is very small variance between “audi” and “volvo” because their average price is almost same.

While the variance between “jaguar” and “honda” is significantly high.

Let’s verify this using the ANOVA method.

The ANOVA test can be performed using the f_oneway() method from Scipy library .

temp_df = df[['make','price']].




get_group('volvo')['price'])This gives us the following result:F_onewayResult(statistic=0.

014303241552631388, pvalue=0.

9063901597143602)The result confirms what we guessed at first.

Since the variance between the price of “audi” and “volvo” is very small, we got a F-test score which is very small (around 0.

01) and a p value around 0.


Let’s do this test once more between “jaguar” and “honda” and see the results.




get_group('honda')['price'])This gives us the following result:F_onewayResult(statistic=400.

925870564337, pvalue=1.

0586193512077862e-11)Notice that in this case, we got a very high F-Test score(around 401) with a p value around 1.

05 * 10^-11 because, the variance between the average price of “jaguar” and “honda” is huge.

CorrelationCorrelation is a statistical metric for measuring to what extent different variables are interdependent.

In other words, when we look at two variables over time, if one variable changes, how doesthis effect change in the other variable?For example, smoking is known to be correlated with lung cancer.

Since, smoking increases the chances of lung cancer.

Another example would be the relationship between the number of hours a student studies and the score obtained by that student.

Because, we expect the student who studies more to obtain higher marks in the exam.

We can see the correlation between different variables using the corr() function.

Then we can plot a heatmap over this output to visualize the results.

correlation_matrix = df.


heatmap(correlation_matrix, annot=True)plt.

show()Heatmap of correlation matrixFrom the above heatmap, we can see that engine size and price are positively correlated(score of 0.

87) with each other while, highway-mpg and price are negatively correlated(score of -0.

7) with each other.

In other words, it tells us that cars with larger engine sizes will be costlier than cars with small engine sizes.

It also tells us that expensive cars generally have less MPG as compared to cheaper cars.

Let’s verify this relationship by plotting regression plots between these variables.


regplot(x='engine-size',y='price',data=df)The above plot shows the positive correlation between engine size and price.


regplot(x='highway-mpg',y='price',data=df)The above plot shows us the negative correlation between “highway-mpg” and “price”.

This was a brief introduction to Exploratory Data Analysis.

Follow our publication to get regular updates on these kind of tutorials.

If you enjoyed reading this article, please have a look at our Introduction to Machine Learning course at Code Heroku.


. More details

Leave a Reply