Don’t worry, Python is here to save us.
We can easily test this using the stats library from scipy in the following way:And you can also get the individual numbers by calling:From the central limit theorem, we know that the distribution of sample means drawn from a population is normal.
As a consequence, the ratio of the mean and standard error follows a student-t distribution.
We can plot the student-t distribution below.
It is centred on zero, where a value of 0 corresponds to our null hypothesis.
We can also plot a vertical line with our measured t-statistic.
# generate points on the x axis between -4 and 4:xpoints = np.
linspace(-4, 4, 500)# use stats.
pdf to get values on the probability density function for the t-distribution# the second argument is the degrees of freedom: n1 + n2–2ypoints = stats.
pdf(xpoints, (50+50–2), 0, 1)# initialize a matplotlib “figure”fig, ax = plt.
subplots(figsize=(8, 5))# plot the lines using matplotlib’s plot function:ax.
plot(xpoints, ypoints, linewidth=3, color=’darkred’)# plot a vertical line for our t-statisticax.
axvline(t_stat, color=’black’, linestyle=’ — ‘, lw=5)plt.
show()Wrapping up: we have measured a difference in blood pressure of -9.
82 between the experimental and control groups.
We then calculated a t-statistic associated with this difference of -1.
This means that there’s a 6.
15% chance that our mean is larger than our mean difference since the p-value is the probability that, given there is a 0.
0 true difference in blood pressure between experimental and control conditions.
Finally, our null hypothesis states there is no difference between groups.
A t-statistic for no difference between groups would be 0.
Recall that our alternative hypothesis is that the difference between groups is not 0.
This could mean the difference is greater than or less than zero — we have not specified which one.
This is known as a 2-tailed t-test and is the one we are currently conducting.
So we should add another line to the right of our mean to represent this:Our p-value corresponds to the area under the curve of the distribution where the magnitude of the t-statistic is greater than or equal to the one we measured since the stats.
cdf function is the cumulative distribution function and will calculate the area under the curve up to a specified t-statistic.
Good versus badNow that we understood more about how to perform a hypothesis test, there’s only one thing left to see: once we have our p-value, what is a good or bad p-value? Here’s when the concept of confidence levels comes in, being the probability that the value of a parameter falls within a specified range of values.
Pay attention a the words ‘range of values’: they make reference to the confidence interval.
Which, according to Wikipedia, is a type of interval estimate, computed from the statistics of some observed data.
The interval has an associated confidence level that, loosely speaking, quantifies the level of confidence that the parameter lies in the interval.
One important thing about the confidence level is that we need to define it prior to examining the data.
Otherwise, we would be cheating :).
Most commonly, the 95% confidence level is used.
However, other confidence levels can be used, for example, 90% and 99%.
Take the following image as a reference for the z value associated with each confidence level:As we have seen before, for any given z-value we’ll have a certain tail probability.
If we have to take one or both tails, it depends on our hypothesis.
For example:Hypothesis testing process in a nutshellWe could wrap up the entire process in the following steps:Define hypothesisSet confidence levelCalculate point estimateCalculate test statisticFind the p-valueInterpret resultsHow else can we apply all this?Testing our train group versus our test group is not the only thing we’d like to do with hypothesis testing in machine learning.
Let’s see a few more use cases:As we said before, if somehow we know the mean of our population, we could run a proper test to know if we have a representative sampleAlso, suppose you’re working in a classification problem, but you have very few features to work with.
Of course, for your model to be able to classify your target well, your predict variables have to be different enough to distinguish between categories.
You could then take the individual vector for each or some of the features, and compare in between categories to confirm if there’s a significant difference between them.
Suppose now you’re working in the A/B testing of a website, and once you ran for three new versions of it, you want to know if there’s a significant difference between all of them.
Then you’d set up a test to know it, comparing A vs B, A vs C and B vs C.
We may also want to check if we may have a symmetrical distribution for any of our features, even our target variable, running a test to check if mode, mean and median are the same.
When the values of mean, median and mode are not equal, then the distribution is said to be asymmetrical or skewed.
We can do this be finding our t-statistic in the following way: t_statistic = (sample_mean — sample_median)/(sample_std/sample_size**0.
5)An alternative way of testingBringing back the case when we want to know if a test or validation group is representative of our dataset, we can also figure it out without going into technicalities about hypothesis testing by transforming this challenge in a classification problem.
As easy as it sounds, using a Random Forest Classifier we can solve this in minutes.
Why a Random Forest?.First of all, let’s know a little more about this algorithm by looking at its definition from Wikipedia: Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
This model is specially adequate to this task since the algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.
Selecting a random sample with replacement of the training set and fits trees to these samples.
However, this procedure describes the original bagging algorithm for trees.
Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features.
This randomness that Random Forest models offer is ideal to test our train group versus our test group, since Random Forest will take any given number of elements and features of our dataset, ensuring our groups are different throughout all of it.
To test this, we’ll use the area under the roc curve (auc).
When we have an area near to 1, that means we’re predicting the classes perfectly, or at least almost perfectly.
If the score is around 0.
5 that means we’re just predicting the baseline.
we’re predicting the majority class.
Since we want our train and test groups to be as similar as possible, we would like to obtain an area near to 0.
Which would mean that we cannot predict classes, and therefore, our test group would be representative of our train data.
So, wrapping up, the only things we have to do use this method are:Divide our data in between train and test groupAdd a column into our data, indicating for example 0 for all the rows in our train group and 1 for all the rows in our test dataConcatenate both groups again into a new dataset, and separate the new column as our target variable for the Random Forest modelCreate a Random Forest modelTo picture this, let’s set an example using a popular sample dataset about bikesharing.
Suppose we’re trying to predict the season according to the given weather data:So we divide our data into train and test groups, ending up with the following datasets:Now we need to add a new column in each group, to indicate whether it is train or test data:Next step, we pop out our target variable from the dataset and we create our Random Forest.
To do this, we’ll use the Sklearn tool RandomForestClassifier:Finally, we only have left the evaluation of our predictions under the roc curve metric.
To do this we’re going to use another Sklearn tool called roc_auc_score:But, wait, that’s not the score we were looking for.
In fact, 0.
75 means that we have some pretty accurate predictions for our data.
What’s happening here? Our Random Forest is overfitting our data.
That’s why we should use the cross_val_score tool from Sklearn:That’s much better! Now we know our test group is representative of our dataset, and we can continue with our project, know there’s no bias in our train-test division :)Thanks for reading!Special thanks to the following resources:https://norstatgroup.
org/math/statistics-probability.. More details