Model’ section, which gets into the nitty gritty for developing a single model.
You can come back there and use it more as a reference guide, as well as the H2O documentation links provided.
You can follow along using the code snippets in the post or the interactive Python Notebook linked below.
ipynbGetting Started with H2O PythonSetting Up DependenciesAs much as I wish I could say implementation is just as easy as throwing out another pip install command, it’s a fair bit more involved than that.
To begin, head over to the H2O stable link here, and download the zip file containing the most recent version.
Follow the commands below to finish installing the package.
cd ~/Downloadsunzip h2o-3.
4698java -jar h2o.
jarAs of writing this article (June 2019), H2O only supports Java SE Runtime Environment Versions 8–11.
You can check your version using the java -version command.
If you have Java SDK 12, you’ll have to uninstall and downgrade to Java SDK 11 to maintain compatibility with H2O.
To do so, execute the following command: /usr/libexec/java_home -V in terminal.
Copy the pathname it returns, and use the following command to uninstall:sudo rm -rf pathname.
Head over to the Oracle JDK 11 download site, create an account, and follow the instructions there to install.
Now that you have the prerequisite packages installed, open up your Python script and execute the code below.
The nthreads parameter simply controls the number of cores on which to perform the operations, and -1 allocates the maximum number of cores available to H2O.
init(nthreads = -1, max_mem_size = 8)h2o.
connect()Data PreprocessingIf you’re doing data science in Python, use Pandas DataFrames.
If you’re coding algorithms from scratch and use NumPy arrays, fine.
But if you think for a moment that H2OFrames is the stuff worth learning, I’ll save you a lot of time and trouble: it isn’t.
To be fair, one of the reasons for why H2O runs much faster than sci-kit learn models are the more efficient data structures that H2O provides over Pandas.
For feature engineering though, we’d recommend sticking with conventional DataFrames, then converting to a H2OFrame once you’re ready for importing.
Because this is a H2O tutorial after all, our processing techniques will be done using H2OFrames.
H2O does also have a great reference guide for data manipulation.
To begin, we’ll import UCI credit card dataset using the h2o.
df = h2o.
csv’)A good deal of the functions for H2OFrames are homologous to Pandas DataFrame functions.
For example, we can examine the columns using the .
columns attribute, which returns a list of feature names (not NumPy array).
We can drop unnecessary features using the drop() function (no axis specification needed).
df = df.
columnsIn the code below, we’ll create a list of feature column names x and target variable names y.
This formatting allows us to pass fewer data structures to the training and predicting functions (sklearn asks for X_train and y_train), significantly improving runtime.
y = ‘default payment next month’x = list(df.
remove(y)While we’re on the topic, it’s important to note that H2O will automatically assume a regression or classification model depending on the data type of the y variable.
A quick check using the df[y].
type attribute shows us that our values are integers (1 for defaulting on loan, 0 otherwise).
We can then convert this column to a factor type using the as_factor() function.
df[y] = df[y].
as_factor()To create training and testing sets, we’ll use the H2O split_frame() function instead of using the sklearn train_test_split() function.
To do so, we need to pass a list of fractions for the training and validation set sizes (testing set size is implicitly calculated).
The function returns a list, with the first element referring to the training set, the second corresponding to the validation set, and the third element being the test set.
splits = df.
15], seed=1)train = splitsvalid = splitstest = splitsWe can also check the size of each set using the .
Note that each of these sets contain both the X variable features and the y target, which is different than our process if we were to have done it in sklearn.
nrow)Building a ‘Hello, World!’ ModelModel ConstructionWe’ll classify our model using a random forest estimator, and again you’ll notice the similarities with sci-kit learn.
Using the H2ORandomForestEstimator() function, we instantiate the model we’ll use for classification (or regression, if we had an integer or float response variable).
random_forest import H2ORandomForestEstimatorrf = H2ORandomForestEstimator(seed=1)seed: This parameter, similar to random_state in other modules, simply controls the random numbers used when creating the model.
This is important for reproducibility during model validation with external datasets.
There are some other parameters specific to each model, such as ntrees or min_split_improvement, that can be specified.
To find out what these are for the model algorithm of your choosing, check out the H2O documentation.
Fitting & Predicting OutcomesTo fit our model to the data, we’ll have to pass at least three parameters: the training_frame, y column, and x columns.
However, if the x parameter is left empty, h2o will assume to use all columns except for the y column when fitting.
Below are some other parameters you may wish to use in the data not included within the snippet below.
Again, a full list can be found with the documentation here.
train(x=x, y=y, training_frame=train)nfolds: Number of folds to be used for cross validation.
For more on this, check out the H2O explanation of cross validation.
balance_classes: When we have a class imbalance in our target feature, we may want to resample our data, either by over-sampling the minority class or under-sampling the majority class when creating the new distribution.
If we set this parameter to True, we can also specify a class_sampling_factors parameter and pass a list of ratios to determine the resulting distribution.
ignored_columns: If we have columns we do not wish to include when fitting, but would be helpful for comparing predicted values such as an observation ID, we can specify that using this parameter.
categorical_encoding: Whereas in sklearn we handled categorical variables with separate functions in the preprocessing stage, H2O can handle each column with techniques such as one hot encoding or label encoding.
A quick note: To mark columns that need to be handled as categorical, use the .
as_factor() function beforehand to change the datatype in the training and testing sets.
A full list of options for categorical_encoding can be found here.
PredictionsReturning predicted probabilities for each class is quite trivial, and can be done using the predict() function.
y_hat = rf.
predict(test_data=test)Upon further examination of y_hat, we see that predict() has returned three columns for each observation: p0, the probability of the observation belonging to class 0; p1, the probability of the observation belonging to class 1; and predict, the predicted classification label.
Here, the decision boundary (threshold) for the classification label was assumed as 0.
5, so any p1 value above 0.
5 would correspond to a predict label of 1.
Performance EvaluationTruly understanding the basics of model evaluation is critical for decision makers in ultimately determining whether or not a model is suitable for deployment and user interaction.
I’ve already written an article on metrics for binary classification here if you’re unfamiliar with the subject or could use a refresher.
To retrieve a report on how well our model did, we can use the model_performance() function, and print the result to the console.
rf_performance = rf.
model_performance(test)print(rf_performance)There are several parts to the output.
First, we see a header that specifies what type of metrics are reported ModelMetricsBinomial, as well as the model type drf.
ModelMetricsBinomial: drf** Reported on train data.
45587629761245807Mean Per-Class Error: 0.
5152722473155975We also receive a confusion matrix with actual labels across the vertical axis, and predicted labels across the horizontal.
Note: The values of the confusion matrix are reported at the threshold (the probability cutoff) that maximizes the F1 score, as opposed to 0.
Our output here identifies that as 0.
Confusion Matrix Calculated at Threshold that Maximizes F1 ScoreWhat’s cool about H2O’s output is that it will automatically calculate every metric at each threshold value, then report the maximum metric value as well as the threshold at which it was attained.
Again, if the vocabulary used here means nothing to you, I would very much recommend reading this article explaining evaluation metrics.
This component can be critical for understanding whether or not the model actually accomplishes what we need it to.
Imagine we’re developing a model that detects disease.
Specificity refers to the proportion of actual disease cases that we correctly predicted as such.
We would much rather maximize specificity over precision, simply because the cost of not detecting a true diseased patient and forgoing early intervention would be far worse than a false alarm.
As an aside, we can return specific metrics by calling a metric function to the H2OModelMetrics Object.
To see a full list of metrics that can be called, hit tab after typing the H2OModelMetrics object name and select an option from the dropdown list (and add () afterwards to return the metric of interest).
confusion_matrix()Reference Links:Guide to Metrics: https://towardsdatascience.
com/hackcvilleds-4636c6c1ba53Quick Cheatsheets for Other Model Types:Deep Learning (DL): An artificial neural network that consists of layered neurons with each successive model layer improving on the previous one through aggregation.
For more background on how this works, check out our article on deep learning.
epochs: The number of times to stream the dataset.
L1: Adds L1 regularization for improved generalization and stability by setting many weight values to 0L2: Adds L2 regularization for improved generalization and stability by lowering many weight valuesadaptive_rate: manual tuning of learning rate is enabled by defaultloss: Specify the loss function with options: Automatic (default), Absolute, Quadratic, Huber, and CrossEntropyDistributed Random Forest (DRF): The DRF creates a forest of regression or classification trees with a dataset.
ntrees: The number of treesmax_depth: Maximum tree depthsample_rate: The row sampling rate from 0 to 1 (default 1)col_sample_rate_per_tree: The column sample rate per tree from 0 to 1Generalized Linear Model (GLM): A linear regression with flexible generalization for handling nonlinearity.
Solver: The solver algorithm to use: auto, l_bfgs, irlsm, coordinate_descent, coordinate_descent_naive, gradient_descent_lh, or gradient_descent_sqerr.
Lambda: Regularization strengthAlpha: Regularization distribution between L1 and L2Gradient Boosting Machine (GBM): Uses an ensemble (collection of models) of weak decision trees through successive refinementntrees: The number of trees for the ensemblelearn_rate: The learning rate with range from 0 to 1sample_rate: The row sampling rate from 0 to 1 (default 1)col_sample_rate: The column sampling rate with range from 0 to 1.
Naive Bayes Classifier: A classification algorithm with strong assumptions of feature independence based on applying Bayes Theoremmax_hit_ratio_k: The maximum prediction number for hit ratio computationmin_prob: The minimum probability to be used for the observationseps_prob: The cutoff below the probability to be replaced with min_probReference Links:Model Estimator Documentation: http://docs.
html#supervisedParameter Documentation: http://h2o-release.
htmlAutoML: Optimization Made EasyAs you’ve probably seen now, choosing the best predictive model can be complicated and time consuming.
Not only would you need to identify the best model, but also the best parameters that maximizes the performance of that model.
Traditionally, we’d do this with Grid Search for hyper parameter tuning, but H2O automates the entire process across multiple models as well.
Automatic Machine Learning (AutoML) automates the process of selecting the best model by training a wide selection of models, enabling those without background expertise in the area to produce high performing models just as well as traditional methods.
Think of sci-kit learn pipelines, just bigger and better.
Currently, AutoML in H2O version 3.
16 supports the following models: a Random Forest, an Extremely-Randomized Forest, a random grid of Deep Neural Nets, a random grid of Gradient Boosting Machines(GBMs), a fixed grid of Generalized Linear Model (GLM), and then trains a Stacked Ensemble of the models.
Building the AutoML EstimatorWe begin by importing the H2O AutoML function and fitting it to the training set, passing the x and y columns in the same way as we did with the previous models.
automl import H2OAutoMLaml = H2OAutoML(max_models=5, max_runtime_secs=300, seed=1)aml.
train(x=x, y=y, training_frame=train)max_models: This controls the maximum number of models to build, not including stacked ensemble models.
max_runtime_secs: This is the maximum runtime in seconds for AutoML to run before training the models.
Performance EvaluationWe can see the best models and their performance on various metrics by calling the leaderboard attribute and storing the output in a H2OFrame lb , which we can then view with the head() function.
lb = aml.
nrows)The output will contain a list of the best models ranked by AUC scores, and display the several goodness-of-fit metrics appropriate for classification or regression.
Leaderboard Output with Maximum Running Time of 300sWe can use the best model from the leaderboard to predict labels on the testing set, as well as review other metrics as shown before.
To do so, the model is stored as aml.
leader, and has all the functions and attributes available to it as a normal classifier would.
y_hat = aml.
confusion_matrix()AutoML Documentation: http://docs.
html#Moving into ProductionOne of the larger benefits of H2O is that all fully trained models can be stored as objects that can be easily deployable to other Java environments for realtime scoring.
To do so, H2O can convert the models to Model Object Optimized (MOJOs).
MOJOs do not have a size restriction unlike Plain Old Java Objects (POJO), and are faster and smaller in disk space than POJOs.
jar file is produced as the output, and is a library for supporting scoring that contains the required readers and interpreters.
This file is also required when deploying the MOJO models to production.
The code below downloads the mojo and the h20-genmodel.
jar file to the user-specified path.
download_mojo(path='pathname', get_genmodel_jar=True)For more on how to access the model in production, see the H2O documentation below.
Production Documentation: http://docs.
html#Parting ThoughtsBy now, you’re familiar with the basics of the H2O API in python, and with a little more help from StackOverflow and the documentation, should be able to get H2O up and running in Python, build and train models, evaluate its performance, and perhaps even look into deployment.
During this process, I’ve found that resources were available from the H2O documentation relative to other sources.
Don’t discount StackOverflow, but don’t expect to find all your questions answered there either.
Let us know if you have any questions, and we’d be happy to provide additional resources as you get started with AutoML and H2O.