Photo by Alex Knight on UnsplashAutoML — A Tool to Improve Your WorkflowA look at H2O AutoML in binary classificationTom AllportBlockedUnblockFollowFollowingJun 11Recently, the upsurge in demand for data science skills has grown faster than the current supply of skills can keep up with.
Today it’s difficult to imagine a business that wouldn’t benefit from the detailed analysis data scientists and machine learning algorithms perform.
As artificial intelligence makes its way into every corner of industry, it’s hard to meet the demand of data scientists in every possible use case.
To elevate the pressure created by this shortage, several companies have started developing frameworks which are able to partly automate the process typically taken by a data scientist.
AutoML is a method which automates the process of applying machine learning techniques to data.
Typically, a data scientist would spend much of their time pre-processing, selecting features, selecting and tuning models and then evaluating the results.
AutoML is able to automate these tasks by providing a baseline result but can provide high performing results to certain problems and insights into where to explore further.
This article is going to look at the Python module H2O and its AutoML feature.
H2O is a Java-based software for data modelling and general computing.
According to H2O.
ai:“The primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.
”AutoML is a function in H2O that automates the process of building large number of models, with the goal of finding the “best” model without any prior knowledge.
AutoML isn’t going to win you any competitions but it can provide lots of information to help you build better models and reduce the time spent exploring and testing different models.
The current version of AutoML function can train and cross-validate a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines, a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.
Stacking (also called meta ensembling) is a model ensembling technique used to combine information from multiple predictive models to generate a new model.
Often times the stacked model (also called 2nd-level model) will outperform each of the individual models due its smoothing nature and ability to highlight each base model where it performs best and discredit each base model where it performs poorly.
For this reason, stacking is most effective when the base models are significantly different.
Stacking visualized — Image from http://supunsetunga.
com/Stacking methods are procedures designed to increase predictive performance by blending or combining the predictions of multiple machine learning models.
There is a variety of ensembling or stacking methods, from simple ones like voting or averaging the predictions, to building complex learning models (logistic regressions, k-nearest neighbours, boosting trees) using the predictions as features.
Stacking of machine learning model predictions very often beat state-of-the-art academic benchmarks and are widely used to win Kaggle competitions.
On the downside, they are usually computationally expensive, but if time and resources are not an issue, a minimal percentage improvement in predictive performance could, for example, help companies save a lot of money.
The AutoML feature can also drastically reduce the time it takes to run some of these stacked methods.
Data ExplorationThis article is going to look at the Mushroom Classification Dataset which can be found on Kaggle and is provided by UCI Machine Learning.
The dataset contains 23 categorical features and over 8000 observations.
The data is classified into two categories, edible and poisonous.
The classes are fairly evenly distributed, with 52% of the observations in the edible class.
There are no missing observations in the data.
This is a popular dataset with over 570 Kernels on Kaggle which we can use to see how well AutoML performs against traditional workflows.
Running H2OFirst you will need to install and import the H2O Python module and H2OAutoML class, as with any other library, and initialize a local H2O cluster.
(I am using Google Colab for this article.
)Then we need to load the data, this can either be done straight into a “H2OFrame” or (as I will do for this dataset) into a panda DataFrame so that we can label encode the data and then convert it to a H2OFrame.
As with many things in H2O, the H2OFrame works very similarly to a Pandas DataFrame but with its slight differences and syntax.
Even though AutoML will be doing most of the work for us in the initial stages it is important that we still have a good understanding of the data we are trying to analyse so that we can build upon its work.
describe()H2OFrame from df.
describe()Similar to functions in sklearn we can create a train test split so that the perfomance of the model can be checked on an unseen dataset to help prevent overfitting.
It is important to note that when splitting frames, H2O does not give an exact split.
It’s designed to be efficient on big data, using a probabilistic splitting method rather than an exact split.
For example, when specifying a 0.
15 split, H2O will produce a test/train split with an expected value of 0.
15 rather than exactly 0.
On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.
Then we need to get the column names for the dataset so we can pass them to the function.
For AutoML there are a few parameters which should be specified x, y, training_frame, validation frame out of which y and training_frame are required parameter and rest are optional.
You can also configure values for max_runtime_sec and max_models here.
max_runtime_sec parameter is required, and max_model is optional if you don’t pass any parameter it takes NULL by default.
The x parameter is the vector of predictors from training_frame if you don’t want to use all predictors from the frame you passed you can set it by passing it to x.
For this problem we are going to send all the parameters in the dataframe to x (except the target) and set the max_runtime_sec to 10 minutes (some of these models take a long time).
Now it’s time to run AutoML:Here the function has been specified to run for 10 minutes but a max number of models could have been specified instead (or alongside).
If you wish to tune the process of how AutoML runs there are also plenty of optional parameters you can pass to do this:validation_frame: This parameter is used for early stopping of individual models in the automl.
It is a dataframe that you pass for validation of a model or can be a part of training data if not passed by you.
leaderboard_frame: If passed the models will be scored according to the values instead of using cross-validation metrics.
Again the values are a part of training data if not passed by you.
nfolds: K-fold cross-validation by default 5, can be used to decrease the model performance.
fold_columns: Specifies the index for cross-validation.
weights_column: If you want to provide weights to specific columns you can use this parameter, assigning weight 0 means you are excluding the column.
ignored_columns: The converse of x.
stopping_metric: Specifies a metric for early stopping of the grid searches and models default value is logloss for classification and deviation for regression.
sort_metric: The parameter to sort the leaderboard models at the end.
This defaults to AUC for binary classification, mean_per_class_error for multinomial classification, and deviance for regression.
Once the models have run, you can view which models have performed best, and consider these for further investigation.
lb = aml.
head()Leaderboard of best models from H2O AutoMLTo check that the model hasn’t been overfitted, we now run it on the test data:preds = aml.
predict(test)ResultsThe AutoML has given an accuracy and F1 score of 1.
0 on the test data, suggesting the model hasn’t been overfit.
Results for best model from AutoML on test dataClearly this is an exceptional case for AutoML as we cannot improve on 100% accuracy on our test set without testing on more data.
Looking at many of the Kernels submitted to Kaggle for this dataset, it seems that lots of people (and even the Kaggle Kernel Bot) were also able to produce the same result using traditional machine learning methods.
Future WorkThe next step would be to save the trained model.
There are two ways to save the leader model — binary format and MOJO format.
If you’re taking your model to production, then it is suggested to use MOJO format since it’s optimised for production use.
Now that you have found the best model for the data, further exploration can be done into steps that will increase the performance of the model.
Maybe the best model on the training data overfits and another of the top models is preferred.
Perhaps the data could be better prepared for some of the models or only select features of most importance.
Many of the best models in H2O AutoML use Ensemble methods and maybe the models the ensemble uses can be tuned further.
Although AutoML alone won’t get you top spot in machine learning competitions, it is definitely worth considering as an addition alongside your blended and stacked models.
AutoML can handle a varitey of different dataset types including binary classification (as was shown here), multi-class classification as well as regression problems.
ConclusionsAutoML is a great tool to help (not replace) the work that data scientists do.
I look forward to see the advances that can be made in AutoML frameworks and how they can benefit all of us as data scientist as well as the organisations they serve.
A single automated mixer certainly cannot outperform a human creative mind when it comes to feature engineering but AutoML is a tool worth exploring in your next data project.
Image from TechRadarKey VocabAutoML — A framework for automating some of the tasks typically performed by a data scientist.
H2OFrame — H2O version of a Pandas DataFrame.
Stacking — a model which takes several different models and creates a prediction based on the results of these “sub-models”.
Further ReadingTutorial on H20 including AutoML and other features — http://docs.
htmlTo learning more about stacked models — http://blog.