The Chi Square Statistic (P3: programming with Python)

The Chi Square Statistic (P3: programming with Python)Nhan TranBlockedUnblockFollowFollowingMar 11Before going to the programming part, please spend couple minutes to read my previous post about The Chi Square Statistic for your reference.

At first, you can choice between many different programming languages for your work such as Python, R, Java, Scala…etc.

In this post, I will use Python because it’s the most popular PL, easy to learn, easy to use, has a lot of supported libraries and of course, I’m quite familiar with Python than others.

Now, take a look on a sample contingency table that we will work today.

This is a dataset that contains total 1000 votes from different races (asian, black, hispanic, other, and white) and parties (democrat, independent and republican):Methodology 01: manual calculationFirst, you need to import 3 basic libraries that support you to process datasetimport numpy as npimport pandas as pdimport scipy.

stats as statsStep 01: Create sample dataIn this sample, I will use the random function from numpy with seed of 10:# Generate under a random factornp.


seed(10)# Sample data randomly at fixed probabilitiesvoter_race = np.


choice(a=["asian","black","hispanic","other","white"], p=[0.

05, 0.

15 ,0.

25, 0.

05, 0.

5], size=1000)# Sample data randomly at fixed probabilitiesvoter_party = np.


choice(a=["democrat","independent","republican"], p=[0.

4, 0.

2, 0.

4], size=1000)# Binding 2 arrays (voter_race and voter_party) to make a DataFramevoters = pd.

DataFrame({"race":voter_race, "party":voter_party})then create a Crosstab from previous DataFrame, assign names for columns and rows:voter_tab = pd.


race, voters.

party, margins=True)voter_tab.

columns = ["democrat", "independent", "republican", "row_totals"]voter_tab.

index = ["asian", "black", "hispanic", "other", "white", "col_totals"]Step 02: Create Observed table and Expected table:Observed table can be extracted from our Crosstab by exclude the row_totals and col_totals.

You can see row_totals is in the index of 4 (in column) and col_totals is in the index of 6 (in row).

So [0:5, 0:3] in the below code snippet means “we will take the rows from 0 index to 5 index and columns from 0 index to 3 index and assign to new Crosstab that named [observed]”observed = voter_tab.

iloc[0:5, 0:3]observed tableExpected table can be calculated using below formula:…now take a look back at our code and see what we have:total_rows = voter_tab[“row_totals”]total_columns = voter_tab[“col_totals”]total_observations = 1000Alright, now is the code to calculate expected table:expected = np.

outer(voter_tab["row_totals"][0:5], voter_tab.

loc["col_totals"][0:3]) / 1000* Please note that the “loc” function in below code is used to switch the index base on column name to row nameAnd then convert expected table into DataFrame, assign names to columns and rows:expected = pd.


columns = ["democrat", "independent", "republican"]expected.

index = ["asian", "black", "hispanic", "other", "white"]expected tableStep 03: Calculate the Chi-Square value and Critical value:Chi square (x²) formulachi_squared_stat = (((observed-expected)**2)/expected).


sum()chi_squared_stat = 7.

16* Note: We call .

sum() twice: once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.

Critical value can be calculate using stats library:crit = stats.



95, df=8)crit = 15.

51* Note: We expect the probability level should be 5% (equivalent to 0.

95) and degree of freedom is 8 which can be calculate using this formula (total rows — 1) x (total columns — 1)Now we can give the final conclusion that races and choices are independent because chi-square < critical value (7.

16 < 15.

51)Methodology 02: calculate using scipy.

stats libraryFirst, we will use the result from step 1 and 2 to get the observed table before applying below code snippet:stats = stats.

chi2_contingency(observed=observed)print(stats)After printing out the value of stats, result should look like:(7.

169321280162059, 0.

518479392948842, 8, array([[ 23.

82 , 11.

16 , 25.

02 ], [ 61.

138, 28.

644, 64.

218], [ 99.

647, 46.

686, 104.

667], [ 15.

086, 7.

068, 15.

846], [197.

309, 92.

442, 207.

249]]))This is an array includes chi_squared_stat, p_value, df and expected_crosstabYou can find the complete source code as follows:chi square example 01.

pychi square example 02.



. More details

Leave a Reply