Applying this to our scenario, if the probability of the mentions coming from the same distribution is below a threshold (defined by us) then we’ll be able to infer that people from different cities have different mention patterns.Let’s define some concepts to clarify things (all the definitions are taken from Wikipedia):Frequency distribution: a list, table or graph that displays the frequency of various outcomes in a sampleNull hypothesis: a general statement or default position that there is no relationship between two measured phenomena, or no association among groupsp-value: the probability, when the null hypothesis is true, of obtaining a result equal to or more extreme than what was actually observed..From these distributions it is possible to calculate the probability of getting particular scores based on the frequencies with which a particular score occurs in a distribution with these common shapes.” — Discovering Statistics Using RUnderstanding the chi-square test for homogeneityWe want to know if the mention distribution is the same for each city..This distribution (all the messages together) should be the same for each city if we assume they come from the same population.We cannot prove that the distributions are different using statistics, but we can reject that they are the same.“The reason that we need the null hypothesis is because we cannot prove the experimental hypothesis using statistics, but we can reject the null hypothesis..We are saying that at a given significance level it’s likely that it’s true.“So, rather than talking about accepting or rejecting a hypothesis (which some textbooks tell you to do) we should be talking about ‘the chances of obtaining the data we’ve collected assuming that the null hypothesis is true’.” — Discovering Statistics Using RIn essence, when we collect data to test theories we can only talk in terms of the probability of obtaining a particular set of data (Field, Andy)..Let’s calculate the expected counts for this sample.Expected outcome = (sum of data in that row)×(sum of data in that column) / total data.So the expected number of messages with mentions (mention=YES) for the city of Sao Paulo is:285*407/1557 = 74,49903The expected value of chisq.test gives the expected counts under the null hypothesis for all the cities:> chisq.test(mentiontable)$expected mentioncity NO YES Belgrade 170.3333 58.66667 Boston 374.8821 129.11790 London 279.6739 96.32606 SanFrancisco 153.9694 53.03057 SaoPaulo 211.9869 73.01310 Toronto 342.1543 117.84571The expected counts are all greater than 5, so we can perform the test.Performing the chi-square testWe’ll assume the distributions are the same, so the total column is the best estimate of what this distribution should be:> tally(~mention, data=df)mention NO YES 1150 407 The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies.For each cell, the expected frequency is subtracted from the observed frequency, the difference is squared, and the total is divided by the expected frequency..This sum is the chi-square test statistic — The chi-square testWith the value from the chi-square test and with the value for the degrees of freedom (number_if_rows -1 × number_of_columns -1) we can calculate the probability of getting the results by chance or not.> chisq.test(mentiontable) Pearson's Chi-squared testdata: mentiontableX-squared = 84.667, df = 5, p-value < 2.2e-16The p-value is lower than the alpha value (0.05), so we will reject the null hypothesis..When you have a categorical variable you can use the chi-square test to find the probability of the distribution being the same for two or more populations (or subgroups of a population).And the steps to use a statistical hypothesis test are:First assume the null hypothesis is trueThen try to prove that it’s impossible that it can be trueThen if we see that indeed, this probably can’t be true for the results we got, we reject the null hypothesis (or otherwise we fail to reject and accept that the data supports the experimental hypothesis)Besides that, we also saw that Spacy.Matcher is a great way to extract information from text.. More details