Usually, validation and testing sets are of the same size, and the training sets typically range from 50% to 90% of the primary dataset, depending on the number of samples that the dataset has.
The more samples a dataset has, the more samples we can afford to dump into our training set.
The first step is to shuffle our dataset to make sure that there isn’t some order associated with our samples.
Then, the chosen split is 70/15/15, so lets split our dataset that way.
We will opt first to separate our validation and test sets apart from our training set, this is because we want our validation and testing sets to have similar distributions.
We can then check the prevalence in each set to make sure they’re roughly the same, so around 20%.
Next, we want to balance our dataset to avoid creating a model where it incorrectly classifies samples as belonging to the majority class; in our case, it would be patients not having a seizure.
This is called the accuracy paradox, for example when the accuracy of our model tells us that we have an 80% accuracy, it will only be reflecting the underlying class distribution if the classes are unbalanced.
Since our model sees that the majority of our samples are not having a seizure, the best thing to achieve a high accuracy score is to classify samples as not having seizures regardless of what we ask it to predict.
There are two straightforward and beginner-friendly ways we can help combat this problem.
Sub-sampling and over-sampling.
We can sub-sample the more dominant class by reducing the number of samples belonging to the more dominant class, or we can over-sample by pasting the same samples of the minority class multiple times until both classes are equal in number.
We will choose to use sub-sampling in this project.
We then save the train_all , train , valid , and test sets as .
Before moving onto importing sklearn and building our first model, we need to scale our variables for some of our models to work.
Since we will be building nine different classification models, we should scale our variables with the StandardScaler .
We dump our scaler as a .
csv file for quick access if we want to use it in other python notebooksClassification ModelsLet’s set it up, so we can print all of our model metrics with one function print_report .
And since we’ve balanced our data, let’s set out threshold at 0.
The threshold is used to determine whether a sample gets classified as positive or negative.
This is because our model returns the percentage chance of a sample belonging to the positive class, so it won’t be a binary classification without setting a threshold.
If the percentage returned for the sample is higher than our threshold, then it will be classified as a positive sample, etc.
Classification ModelsWe will cover the following models:K Nearest NeighborsLogistic RegressionStochastic Gradient DescentNaive BayesDecision TreesRandom ForestExtreme Random Forest (ExtraTrees)Gradient BoostingExtreme Gradient Boosting (XGBoost)We will use baseline default arguments for all models, then choose the model with the highest validation score to perform hyperparameter tuning.
K Nearest Neighbors (KNN)KNN is one of the first models that people learn when it comes to scikitlearn ‘s, classification models.
The model classifies the sample based on the k samples that are closest to it.
For example, if k = 3, and all three of the nearest samples are of the positive class, then the sample would be classified as class 1.
If two out of the three nearest samples are of the positive class, then the sample would have a 66% probability to be classified as positive.
We get a pretty high training Area Under the Curve (AUC) Receiver Operator Curve (ROC), and a high validation AUC as well.
This metric is used to measure the performance of classification models.
AUC tells us how much the model is capable of distinguishing between classes, the higher the AUC, the better the model is at distinguishing between classes.
If the AUC is 0.
5, then you might as well guess at the samples.
Logistic RegressionLogistic regression is a type of generalized linear model, which are a generalization of the concepts and abilities of regular linear models.
In logistic regression, the model predicts if something is true or false, rather than predicting something continuous.
The model fits a linear decision boundary for both classes, then is passed through a sigmoid function to transform from the log of odds to the probability that the sample belongs to the positive class.
Because the model tries to find the best separation between the positive class and negative class, this model performs well when the data separation is noticeable.
This is one of the models that require all features be scaled, and that the dependent variable is dichotomous.
Stochastic Gradient DescentGradient descent is an algorithm that minimizes many loss functions across many different models, such as linear regression, logistic regression, and clustering models.
It is similar to logistic regression, where gradient descent is used to optimize the linear function.
The difference is that stochastic gradient descent allows mini-batch learning, where the model uses multiple samples to take a single step instead of the whole dataset.
It is especially useful where there are redundancies in the data, usually seen through clustering.
SGD is therefore much faster than logistic regression.
Naive BayesThe naive Bayes classifier uses the Bayes theorem to perform classification.
It assumes that if all features are not related to each other, then the probability of seeing the features together are just the product of the probability of each feature happening.
It finds the probability of the sample being classified as positive, given all the different combinations of features.
The model is often flawed because the “naive” part of the model assumes all features are independent, and that’s not the case most of the time.
Decision TreesA decision tree is a model where it runs a sample down multiple “questions” to determine its class.
The classifying algorithm works by repetitively separating data into sub-regions of the same class and the tree ends when the algorithm has divided all samples into categories that are pure, or by meeting some criteria of the classifier attributes.
Decision trees are weak learners, and by that, I mean they are not particularly accurate, and they often only do a bit better than randomly guessing.
They also almost always overfit the training data.
Random ForestSince decision trees are likely to overfit, the random forest was created to reduce that.
Many decision trees make up a random forest model.
A random forest consists of bootstrapping the dataset and using a random subset of features for each decision tree to reduce the correlation of each tree, hence reducing the probability of overfitting.
We can measure how good a random forest is by using the “out-of-bag” data that weren’t used for any trees to test the model.
Random forest is also almost always preferred over a decision tree since the model has a lower variance; hence, the model can generalize better.
Extremely Randomized TreesThe ExtraTrees Classifier is similar to Random Forest except:When choosing a variable at the split, samples are drawn from the entire training set rather than bootstrapping samplesNode splits are selected at random, instead of being specified like in Random ForestThis makes the ExtraTrees Classifier less prone to overfit, and it can often produce a more generalized model than Random Forest.
Gradient BoostingGradient boosting is another model that combats the overfitting of decision trees.
However, there are some differences between GB and RF.
Gradient boosting builds shorter trees, one at a time, and each new tree reduces the error the previous tree has made.
The error is called the pseudo-residual.
Gradient boosting is faster than a random forest, and are useful in lots of real-world applications.
However, gradient boosting doesn’t do that well when your dataset contains noisy data.
Extreme Gradient BoostingXGBoost is similar to gradient boosting exceptTrees have a varying number of terminal nodesLeaf weights of the trees that are calculated with less evidence are shrunk more heavilyNewton Boosting provides a direct route to the minima than gradient descentExtra randomization parameter is used to reduce the correlation between treesUses a more regularized model to control over-fitting since standard GBM has no regularization, which gives it better performance over GBM.
XGB implements parallel processing and is much faster than GBM.
Model Selection and ValidationThe next step is to visualize the performance of all of our models in one graph; it makes it easier to pick which one we want to tune.
The metric I chose to evaluate my models is the AUC curve.
You can choose any metric you want to optimize for, such as accuracy or lift, however, the AUC isn’t affected by the threshold you choose, so it’s a metric that most people use to evaluate their models.
Seven of the nine models have a very high performance, and this is most likely due to the extreme differences in EEG readings between a patient having a seizure and not having one.
The decision tree looks like it overfitted as expected, notice the gap between the training AUC and the validation AUC.
I’m going to pick XGBoost and ExtraTrees classifier as the two models to tune.
Learning CurvesLearning curves are a way for us to visualize the bias-variance tradeoff in our models.
We make use of the learning curve code from scikit-learn but plot the AUC instead since that’s the metric we chose to evaluate our models with.
Both the training and CV curves are high, so this signals both low variance and low bias in our ExtraTrees classifier.
However, if you see both curves having a low score and are similar, that’s a sign of high bias.
If your curves have a big gap, that’s a sign of high variance.
Here are some tips on what to do in both scenarios:High Bias:- Increase model complexity- Reduce regularization- Change model architecture- Add new featuresHigh Variance:- Add more samples- Reduce the number of features- Add/increase regularization- Decrease model complexity- Combine features- Change model architectureFeature ImportanceJust like in regression models, you can tell the magnitude of impact from feature coefficients; you can do the same in classification models.
According to your bias-variance diagnosis, you may choose to drop features or to come up with new variables by combining some, according to this graph.
However, for my model, there is no need to do that.
Technically speaking, EEG readings is the only feature that I have, and the more readings, the better the classification model will become.
Hyperparameter TuningThe next step one should perform is to tune the knobs in our model, also known as hyperparameter tuning.
There are several ways to do this.
Grid SearchThis is a traditional technique for hyperparameter tuning, meaning that it was the first to be developed outside of manually tuning each hyperparameter.
It requires all inputs of relevant hyperparameters (e.
, all the learning rates you want to test) and measures the performance of the model using cross-validation by going through all possible combinations of the hyperparameter values.
The drawback to this method is that it would take a long time to evaluate when we have lots of hyperparameters we want to tune.
Random SearchRandom search uses random combinations of the hyperparameter to find the best performing model.
You still need to input all values of the hyperparameters you want to tune, however the algorithm searches across the grid randomly, instead of searching all of the combinations of all values of the hyperparameters.
This often beats grid search in terms of time due to its random nature where the model could reach its optimized value much sooner than grid search according to this paper.
Genetic ProgrammingGenetic programming or genetic algorithm (GA) is based on Charles Darwin’s theory of survival of the fittest.
GA applies small, slow, and random changes to the current hyperparameters.
It works by assigning a fitness value to a solution, the higher the fitness value, the higher the quality of the solution.
It then selects the individuals with the highest fitness values and puts them into a “mating pool” where two individuals will generate two offspring (with some changes applied to the offspring), which is expected to have higher quality than their parents.
This happens over and over until we get to the desired optimal value.
TPOT is an open source library under active development, first developed by researchers at the University of Pennsylvania.
It takes multiple copies of the entire training dataset, and performs its own variation of one-hot encoding (if needed), then optimizes the hyperparameters using genetic algorithm.
We will use dask with tpot’s automl to perform this.
We pass xgboost and extratrees classifiers into the tpot config to tell it we only want the algorithm to perform searches within these two classification models.
We also tell tpot to export every model made to a destination in case we want to stop it early.
Model EvaluationThe best performing model, with an AUC of 0.
997, is the optimized extratrees classifier.
Below is its performance on all three datasets.
We also create the ROC curve graph to show the above AUC curves.
ConclusionNow, communicating the essential points of this project to a VP or CEO may often time be the hardest part of the job, so here is what I would say to a high-level stakeholder concisely.
In this project, we created a classification machine learning model that can predict whether patients are having a seizure or not through EEG readings.
The best performing model has a lift metric of 4.
3, meaning it is 4.
3 times better than just randomly guessing.
It is also 97.
4% correct in predicting the positive classes in the test set.
If this model was put into production to predict whether a patient is having a seizure, you could expect that performance in correctly predicting those who are having a seizure.
Thank you for reading!.