If you use all of your data to train the algorithm, there is no data left to test if the algorithm learned to predict/classify anything.
So you want to save some data to test if the learning actually worked.
Luckily there are tools available to randomly pick these subsets for you.
Split into a typical 75% training, 25% testing data using your separated datasets for features and labels:from sklearn.
model_selection import train_test_splitX_train, x_test, Y_train, y_test = train_test_split(myData, labels, test_size=0.
25)Wait, what if I don’t have label data already?Actually, this is probably usually the case, if you just start by exploring an existing dataset.
Here are a few strategies you could try to create labels, and redo this section once you have them!Hidden somewhere in your existing dataRedefine the problem to use existing dataIdentify what you need and add code to gather itHack it together with excel or pythonExpert labeling: who is qualified?Crowdsource: wisdom of the crowdIn our example, we were lucky enough to have them all ready to go!Step 4: TrainingMost people find it funny how simple this step is.
That’s because decades of hard work has gone into standardizing and tuning these algorithms, so you can just use them.
Choose algorithm: The choice of algorithm really depends on the problem you set out to solve.
If you’re predicting real-estate value or forecasting revenues, you’re looking for Regression algorithms that will give you a clear number as the output.
If you’re trying to make a decision, that would often fall under Classification algorithms.
Classification algorithms can give you the best answer or probabilities for all possible answers, depending on what you want.
There are dozens of flavors of each type of course, and some involve neural networks.
Given the tuning challenges there, you’re better off starting elsewhere though.
Often the best place to start is a simple linear algorithm, as it literally draws a straight line on top of your dataset.
From there, you can optimize the result by exploring other methods such as Decision Trees or Support Vector Machines.
In this case, we’re going to try a few types of Classifiers algorithms, since this is a classification problem (predict churn or not churn as the output).
Linear Classifier:from sklearn.
linear_model import SGDClassifieralgo = SGDClassifier()Support Vector Machine:from sklearn.
svm import LinearSVCalgo = LinearSVC()Decision Tree:from sklearn import treealgo = tree.
DecisionTreeClassifier()XGBoost (fancy Decision Tree):import xgboost as xgbalgo = xgb.
XGBClassifier()NOTE: To figure out what these are, how they work, and why, I refer you to my previous post on the topic.
Training: There are a few different ways to do learning besides the basic case above of using all data at once, usually depending on how much data you have, how fast your computer is, and whether this is a one-time operation of you need to add new training data in the future.
You can look up further examples for mini-batch or online learning if needed.
The base case is incredibly simple.
It almost couldn’t be easier, once you’ve done all this prep work.
It should take like 1 second with a basic computer.
fit(X_train, Y_train)Prediction: To actually use the model you’ve just trained, you need to predict something.
Again, if you’re using Regression it’ll be a number.
For Classification, you either get a label, or the probability for each label.
Predict a single output for each row of inputs, for example, row 2 of our dataset:algo.
loc])Client number two will not leave us, then!Predict label probabilities for a classification problem, by manually entering inputs:algo.
loc])The algo is quite confident.
Step 5: EvaluationAt this point, it feels like you’re done.
Technically you now have a solution.
But you need to find out if it’s any good.
The typical judge of that is called accuracy, which just means that how many of the samples in your test dataset did it get right.
Accuracy: To begin with, this is really the gold standard of measuring if your algorithm works.
If it gets the right result often enough to solve your problem, you’re good to proceed at least.
There are a lot of exceptions to this, of course, chief among them how well your training data represents real-life data the algorithm will see in the future.
Often this means training is not a one-and-done type of deal, but something you revisit if the accuracy with real data starts dropping dramatically.
First, you should check how it does on the training data.
Meaning did it learn to predict the exact same samples it already saw before.
score(X_train, Y_train)That means that in 97% of the tests, it got the answer right.
Pretty good, but then it kind of already knew the answer.
This is why we put aside some secret data earlier, to show the algo some new data!Measure accuracy on the testing set:algo.
score(x_test, y_test)Okay, this is the right algo for sure.
It did almost as well on the test dataset.
It’s a keeper!A bunch of other metrics you’ll have to read about to understand fully:from sklearn.
metrics import classification_report y_pred = algo.
predict(x_test) print(classification_report(y_test, y_pred)) So on the bottom, we have the same 96%.
There were total 834 customers in the test dataset we used, of which only 130 were churns.
So based on Precision this algo could pretty accurately predict both False and True cases for churn, with few false positives which is great.
The lower Recall number for the True case is worrying though, as it means there are more false negatives, i.
people we thought would stay that end up leaving!So does it do the job?.Yes, definitely.
Is it perfect?.No, nothing is.
The ultimate answer depends on the type of problem you’re solving, and what the risk of false positives/negatives is.
If you’re predicting cancer or something, it’s pretty important!Repeat until satisfiedAt any step above, you may realize you’ve done something wrong and it just won’t work.
Most often, this involves the data itself.
Having good, clean data to work with will make all other steps so much easier.
Perhaps you’re worried about the number of false negatives and want to improve it.
Maybe it’s the distribution of the dataset.
Perhaps you could benchmark different algorithms.
Perhaps there is skew or bias in the test or training dataset.
Perhaps you should try more encoding.
Perhaps delete more features.
This is the job of the datascientist!Congratulations, you’re now well on your way to create your first Machine Learning program.
There are of course further considerations for saving and exporting your model to run in an actual application or server.
You can easily search online to explore these topics further, with plenty of tutorials and free online courses available.
If you’re totally lost at this point, having no idea how and why you ended up here, then you can read this for more context and then retry:How Machine Learning is changing Software DevelopmentI’m not here to talk to you about how amazing A.
is, what Deepmind is working on, or speculate about robotic…medium.