Hypothesis testing in the Northwind dataset using ANOVALocating the most profitable customersLaura LewisBlockedUnblockFollowFollowingMar 9Project aimAs part of a project on the Northwind database, I needed to come up with some questions to ask of the data in order to derive valuable business insights for the company.

The Northwind database is a sample database from Microsoft for a fictitious company called Northwind Traders, which imports and exports consumables from around the world.

The database contains a range of information on the company’s activities, including on customers, their orders, the products and their suppliers.

I was interested in gaining a better understanding of their customers, and also figuring out which customer groups contributed most to Northwind’s bottom line.

So one of the questions I decided to ask was:Does the average amount spent per order vary between customers from different regions?Stating the hypothesesTo answer this question, I first created a null and alternate hypothesis:Null hypothesis: the average amount spent per order is the same between different customer regionsAlternate hypothesis: the average amount spent per order is different (either higher or lower) between different customer regionsThe alpha level (i.

e.

the probability of rejecting the null hypothesis when it is true) was set at 0.

05.

Obtaining and scrubbing the dataThe first step was to extract the right data using this (only slightly mislabelled) schema:Northwind database schemaTo answer the question under investigation, I needed to extract the customer region, the total quantity of each product ordered, the unit price and the discount level for each order.

To do this I joined the Customer and OrderDetail tables using the Order table:# Importing the required librariesfrom sqlalchemy import create_enginefrom sqlalchemy.

orm import Session, sessionmakerimport pandas as pdimport seaborn as snsimport numpy as npimport matplotlib.

pyplot as pltimport statsmodels.

api as smfrom statsmodels.

formula.

api import ols# Creating an engine and connecting to a database with SQLAlchemyengine = create_engine('sqlite:///Northwind_small.

sqlite', echo=True)Session = sessionmaker(bind=engine)session = Session()con = engine.

connect()# Extracting the data required as a pandas dataframe using a SQL querydf3 = pd.

read_sql_query(''' SELECT c.

Region, od.

OrderId, od.

Quantity, od.

UnitPrice, od.

Discount FROM Customer c JOIN [Order] o ON c.

Id = o.

CustomerId JOIN OrderDetail od ON od.

OrderId = o.

Id ''', engine)The resulting dataframe now showed each order ID, quantity, unit price and discount for each product in each order:But what I actually wanted was the total price per order.

This was calculated using the code in the cell below, which applies the discount to the product of the unit price and the unit quantity and then sums for each order and drops the columns that are no longer needed:# Calculating the price per sub-orderdf3['price_per_order'] = df3.

Quantity * df3.

UnitPrice * (1 – df3.

Discount)# Dropping the columns for quantity, unit price and discount now that we have the total pricedf3.

drop(['Quantity', 'UnitPrice', 'Discount'], axis=1, inplace=True)# Grouping the data by order and summing the price for each orderdf3 = df3.

groupby(['Region', 'OrderId'])['price_per_order'].

sum().

reset_index()# Dropping the OrderId as we no longer need thisdf3.

drop('OrderId', axis=1, inplace=True)This left me with a dataframe that only included the data I need:However, a quick count of the number of orders from each region revealed that customers from some regions have only placed a small number of orders:Because ANOVA will be used to test the hypothesis, it is best to ensure relatively equal sample sizes, particularly to satisfy the assumption of equal variances.

Smaller groups were therefore combined with other groups that make geographical sense, and a minimum group size of 30 was used.

# Combining Eastern and Southern Europedf3.

loc[(df3.

Region == 'Eastern Europe') | (df3.

Region == 'Southern Europe'),'Region'] = 'Southern and Eastern Europe'# Combining Scandinavia and Northern Europedf3.

loc[(df3.

Region == 'Scandinavia') | (df3.

Region == 'Northern Europe'),'Region'] = 'Northern Europe and Scandinavia'# Combining Central and South Americadf3.

loc[(df3.

Region == 'Central America') | (df3.

Region == 'South America'),'Region'] = 'South and Central America'Exploring the dataNow that the data was in the right format and grouped correctly, it was possible to start exploring it.

The following code block produces the graph below, which plots the number of orders, total price per order and average price per order for each region:# Plotting the number of orders, total price per order and average price per order for each regionfig, (ax1, ax2, ax3) = plt.

subplots(3, 1, figsize=(8,8))df3.

groupby(['Region'])['Region'].

count().

plot(kind='barh', ax=ax1)df3.

groupby(['Region'])['price_per_order'].

sum().

plot(kind='barh', ax=ax2)df3.

groupby(['Region'])['price_per_order'].

mean().

plot(kind='barh', ax=ax3)ax1.

set_title('Total number of orders')ax1.

set_ylabel('')ax2.

set_title('Total price of orders ($)')ax2.

set_ylabel('')ax3.

set_title('Average price per order ($)')ax3.

set_ylabel('')fig.

subplots_adjust(hspace=0.

4);The graphs show that Western Europe is the region with the greatest number of orders, and also has the greatest total price of orders.

However, North America has the most expensive order on average (followed by Western Europe).

Southern and Eastern Europe has the lowest number of orders, lowest total price of orders, and cheapest order on average.

The third graph lent support to the alternate hypothesis that there are significant differences in average order price between regions.

This was further investigated through the use of statistical hypothesis testing.

Hypothesis testing and resultsTo answer the question of whether the average amount spent per order varies between customers from different regions, ANOVA was used.

This assesses the degree of variation between multiple samples, where each sample is a different region in this case.

ANOVA assumes that the data is normally distributed, and that the samples have similar variances.

Distribution plots were made using seaborn, which demonstrated that the data was very heavily positively skewed, with long tails.

Log-transforming the data resulted in more normally distributed data with more similar distributions.

The distributions of the log-transformed data (shown below) were produced with the following code:# Copying the dataset and log-transforming price_per_orderdf3_log = df3.

copy()df3_log['price_per_order'] = np.

log(df3['price_per_order'])# Plotting the distributions for the log-transformed dataplt.

figure(figsize=(8,5))for region in set(df3_log.

Region): region_group = df3_log.

loc[df3_log['Region'] == region] sns.

distplot(region_group['price_per_order'], hist_kws=dict(alpha=0.

2), label=region) plt.

legend() plt.

xlabel('Price per order (log-transformed)')The data was now more normally distributed, and the variances from the mean were more similar.

It was now possible to run an ANOVA test:# Fitting a model of price_per_order on Region categories, and using statsmodels to compute an ANOVA tablelm = ols('price_per_order ~ C(Region)', df3_log).

fit()sm.

stats.

anova_lm(lm, typ=2)The ANOVA table above revealed that the p-value is lower than the alpha value of 0.

05.

Therefore I was able to reject the null hypothesis and accept the alternate hypothesis.

There are statistically significant differences in average order value between different regions, i.

e.

customers from different parts of the world spend different amounts of money on their orders, on average.

ConclusionsBusiness insights:There are statistically significant differences in the average price per order from customers from different regions.

Western European customers place the most orders, and are the single biggest contributors to Northwind’s bottom line.

However, although North American customers have placed roughly half as many orders as those from Western Europe, they spend more per order, on average.

The difference between the region with the most expensive orders on average (North America, $1,945.

93) and the region with the least expensive orders (Southern and Eastern Europe, $686.

73) is $1,259.

20, or 2.

8 times more for orders from North America.

Southern and Eastern Europe has the smallest number of orders, the lowest total price of orders, and the lowest average price per order.

North American customers have placed a similar number of orders to those from South and Central America, but their average expenditure per order is 1.

8 times higher.

Potential business actions and directions for future work:If Northwind was looking to focus on more profitable customers, a potential action would be to stop serving customers in Southern and Eastern Europe, and to focus more on customers in Western Europe and North America.

However, further analysis would be needed to confirm these findings.

For example, it might be the case that some more expensive products are only available in certain regions.

.. More details