Democratising Machine learning with H2OOverview of H2O: the open source, distributed in-memory machine learning platformParul PandeyBlockedUnblockFollowFollowingApr 20It is important to make AI accessible to everyone for the sake of social and economic stability.
Kaggle days is a two-day event where data science enthusiasts can talk to each other face to face, exchange knowledge, and compete together.
Kaggle days San Francisco just concluded and as is customary, Kaggle also organised a hackathon for the participants.
I had been following Kaggle days on Twitter and the following tweet from Erin LeDell (Chief Machine Learning Scientist at H2O.
ai) caught my eye.
Source: TwitterI have been experimenting with H2O for quite some time and found it really seamless and intuitive for solving ML problems.
Seeing it perform so well on Leaderboard, I thought it was time that I wrote an article on the same to make it easy for others to make a transition into the world of H2O.
ai: The company behind H2OH2O.
ai is based in Mountain View, California and offers a suite of Machine Learning platforms.
H2O’s core strength is its high-performing ML components, which are tightly integrated.
ai is a Visionary in the Gartner Magic Quadrant for Data Science Platforms in its report released in Jan’2019.
Source: Gartner (January 2019)Let’s take a brief look at the offerings of H2O.
ai Products and SolutionsH2OH2O is an open source, distributed in-memory machine learning platform with linear scalability.
H2O supports the most widely used statistical & machine learning algorithms and also has an AutoML functionality.
H2O’s core code is written in Java and its REST API allows access to all the capabilities of H2O from an external program or script.
H2O Sparkling WaterSparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark.
Sparkling Water is ideal for H2O users who need to manage large clusters for their data processing needs and want to transfer data from Spark to H2O (or vice versa).
H2O4GPUH2O4GPU is an open source, GPU-accelerated machine learning package with APIs in Python and R that allows anyone to take advantage of GPUs to build advanced machine learning models.
H2O Driverless AIDriverless AI’s UIH2O Driverless AI is H2O.
ai’s flagship product for automatic machine learning.
It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment.
With Driverless AI, data scientists of all proficiency levels can train and deploy modelling pipelines with just a few clicks from the GUI.
Driverless AI is a commercially licensed product with a 21-day free trial version.
What is H2OThe latest version called H2O-3 is the third incarnation of H2O.
H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark.
H2O can easily and quickly derive insights from the data through faster and better predictive modelling.
High-Level ArchitectureH2O makes it possible to import data from multiple sources and has a fast, Scalable & Distributed Compute Engine Written in Java.
Here is a high-level overview of the platform.
A High-Level architecture of h2oSupported AlgorithmsH2O supports a lot of commonly used algorithms of Machine Learning.
Algorithms supported by H2OInstallationH2O offers an R package that can be installed from CRAN and a python package that can be installed from PyPI.
In this article, I shall be working with only the Python implementation.
Also, you may want to look at the documentation for complete details.
Pre-requisitesPythonJava 7 or later, which you can get at the Java download page.
To build H2O or run H2O tests, the 64-bit JDK is required.
To run the H2O binary using either the command line, R or Python packages, only 64-bit JRE is required.
Dependencies :pip install requestspip install tabulatepip install "colorama>=0.
8"pip install futurepip installpip install -f http://h2o-release.
html h2ocondaconda install -c h2oai h2o=3.
2Note: When installing H2O from pip in OS X El Capitan, users must include the –user flag.
For example -pip install -f http://h2o-release.
html h2o –userFor R installation please refer to the official documentation here.
Testing installationEvery new python session begins by initializing a connection between the python client and the H2O cluster.
A cluster is a group of H2O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job.
To check if everything is in place, open your Jupyter Notebooks and type in the following:import h2oh2o.
init()This is a local H2O cluster.
On executing the cell, some information will be printed on the screen in a tabular format displaying amongst other things, the number of nodes, total memory, Python version etc.
In case you need to report a bug, make sure you include all this information.
Also, the h2o.
init() makes sure that no prior instance of H2O is running.
init() (in Python)By default, H2O instance uses all the cores and about 25% of the system’s memory.
However, in case you wish to allocate it a fixed chunk of memory, you can specify it in the init function.
Let’s say we want to give the H2O instance 4GB of memory and it should only use 2 cores.
init(nthreads=2,max_mem_size=4)Now our H2O instance is using only 2 cores and around 4GB of memory.
However, we will go with the default method.
Importing Data with H2O in PythonAfter the installation is successful, it’s time to get our hands dirty by working on a real-world dataset.
We will be working on a Regression problem using the famous wine dataset.
The task here is to predict the quality of white wine on a scale of 0–10 given a set of features as inputs.
Here is a link to the Github Repository in case you want to follow along or you can view it on my binder by clicking the image below.
DataThe data belongs to the white variants of the Portuguese “Vinho Verde” wine.
edu/ml/datasets/Wine+QualityCSV FIle : (https://archive.
csv)Data ImportImporting data from a local CSV file.
The command is very similar to pandas.
read_csv and the data is stored in memory as a H2OFrame.
wine_data = h2o.
head(5)# The default head() command displays the first 10 rows.
Displaying the first 5 rows of the datasetEDALet us explore the dataset to get some insights.
describe()Exploring some of the columns of the datasetAll the features here are numbers and there aren’t any categorical variables.
Now let us also look at the correlation of the individual features.
pyplot as pltimport seaborn as snsplt.
figure(figsize=(10,10))corr = wine_data.
index = wine_data.
heatmap(corr, annot = True, cmap='RdYlGn', vmin=-1, vmax=1)plt.
title("Correlation Heatmap", fontsize=16)plt.
show()Modeling with H2OWe shall build a regression model to predict the Quality of the wine.
There are a lot of algorithms available in the H2O module both for Classification as well as Regression problems.
Splitting data into Test and Training setsSince we have only one dataset, let’s split it into training and Testing part, so that we can evaluate the model’s performance.
We shall use the split_frame() function.
wine_split = wine_data.
split_frame(ratios = [0.
8], seed = 1234)wine_train = wine_split # using 80% for trainingwine_test = wine_split #rest 20% for testingprint(wine_train.
shape)(3932, 12) (966, 12)Defining Predictor Variablespredictors = list(wine_data.
remove('quality') # Since we need to predict qualitypredictorsGeneralized Linear ModelWe shall build a Generalized Linear Model (GLM) with default settings.
Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions.
In addition to the Gaussian (i.
normal) distribution, these include Poisson, binomial, and gamma distributions.
You can read more about GLM in the documentation.
# Import the function for GLMfrom h2o.
glm import H2OGeneralizedLinearEstimator# Set up GLM for regressionglm = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')# Use .
train() to build the modelglm.
train(x = predictors, y = 'quality', training_frame = wine_train)print(glm)GLM model’s parameters on the Training setNow, let’s check the model’s performance on the test datasetglm.
model_performance(wine_test)Making PredictionsUsing the GLM model to make predictions in the test dataset.
predictions = glm.
head(5)Similarly, you could use other supervised algorithms like Distributed Random Forest, Gradient Boosting Machines, and even Deep Learning.
you could also tune in the hyperparameters.
H2OAutoML: Automatic Machine LearningAutomated machine learning (AutoML) is the process of automating the end-to-end process of applying machine learning to real-world problems.
AutoML makes machine learning available in a true sense, even to people with no major expertise in this field.
H2O’s AutoML tends to automate the training and the tuning part of the models.
H2O AutoML: Available AlgosIn this section, we shall be using the AutoML capabilities of H2O to work on the same regression problem of predicting wine quality.
Importing the AutoML Modulefrom h2o.
automl import H2OAutoMLaml = H2OAutoML(max_models = 20, max_runtime_secs=100, seed = 1)Here AutoML will run for 10 base models for 100 seconds.
The default runtime is 1 Hour.
train(x=predictors, y='quality', training_frame=wine_train, validation_frame=wine_test)LeaderboardNow let us look at the automl leaderboard.
leaderboard)AutoML LeaderboardThe leaderboard displays the top 10 models built by AutoML with their parameters.
The best model is placed on the top is a Stacked Ensemble.
The leader model is stored as aml.
leaderContribution of Individual ModelsLet us look at the contribution of the individual models for this meta-learner.
metalearner = h2o.
std_coef_plot()XRT( Extremely Randomized Trees) has the maximum contribution followed by Distributed Random Forests.
Predictionspreds = aml.
predict(wine_test)The code above is the quickest way to get started, however, to learn more about H2O AutoML it is worth taking a look at the in-depth AutoML tutorial (available in R and Python).
shutdown()Using Flow — H2O’s Web UIIn the final leg of this article, let us have a quick overview of H2O’s open source Web UI called Flow.
FLow is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks.
Launching FLowOnce H2O is up and running all you need to do is point your browser to http://localhost:54321 and you’ll see our very nice user interface called Flow.
Launching H2O flowFlow InterfaceHere is a quick glance over the flow interface.
You can read more about using and working with it here.
H2O’s flow interfaceFlow is designed to help data scientists rapidly and easily create models, import files, split data frames and do all the things that would normally require quite a bit of typing in other environments.
WorkingLet’s work through our same wine example but this time with Flow.
The following video explains the model building and prediction using flow and it is kind of self-explanatory.
Demonstration of H2O FlowConclusionH2O is a powerful tool and given its capabilities, it can really transform the Data Science process for good.
The capabilities and advantages of AI should be made available to everybody and not a select few.
This is the real essence of Democratisation and Democratising Data Science should is essential for resolving Real problems threatening our planet.