Machine Learning for Particle Data When You are Not a Physicist

Photo credit: PixabayMachine Learning for Particle Data When You are Not a PhysicistHow a H2O deep learning model can be used to do supervised classification with PythonSusan LiBlockedUnblockFollowFollowingFeb 27This article introduces Deep Learning with H2O, the open source machine learning package by H2O.

ai, and shows how a H2O Deep Learning model can be used to solve supervised classification problem, that is, use the ATLAS experiment to identify the Higgs boson.

We have no knowledge of particle physics, but we still want to apply the advanced machine learning methods to see whether we can accurately predict particle collision events as being either a Higgs Boson signal (s) or background (b) noise.

Please note, applying machine learning methods to a domain you are not familiar with, is not to say, you can become a competent data scientist without domain expertise.

They are different things.

With that in mind, let’s get started.

Installation H2OInstalling H2O in Python is rather straightforward.

I followed the instructions on H2O’s official Docs: Downloading & Installing H2O.

It works like a charm!Initialize H2O & Load The DataThe data set can be found here.

And we load the H2O Python module.

Start up a 1-node H2O cloud on my local machine.

Allow it to use all CPU cores and up to 2GB of memory.

Cleaning up in case cluster was already running.

Build a Deep Neural Network model using CPUs on an H2OFrame.

import h2oh2o.

init(max_mem_size = 2) h2o.

remove_all()%matplotlib inlineimport matplotlib.

pyplot as pltimport numpy as npimport pandas as pdfrom h2o.


deeplearning import H2ODeepLearningEstimatorOnce the initialization is done, we could upload our data set to the the H2O cluster.

The data is imported into H2O Frames, which operate similarly in function to pandas data frames.

In our case the cluster is running on our laptops.

higgs = h2o.


csv')We could take a quick look of the data set.


describe()describe() gives out a lot of information.

Number of rows and columns in the data set.

Summary statistics about the data set such as data type, minimum value, mean value, maximum value, standard deviation, number of zeros in the column, number of missing values in the column and top 10 rows of the data set.

Figure 1Data Pre-processingWe split the data in to the following ways:60% for training20% for validation (hyper parameter tuning)20% for final testing, will be withheld until the endThe predictors are all the columns except “EventId” and “Label”.

The response column is the last column “Label”.

train, valid, test = higgs.


6, 0.

2], seed = 2019)higgs_X = higgs.

col_names[1: -1]higgs_y = higgs.

col_names[-1]Deep Learning ModelsModel 1Run our first deep learning model on the Higgs boson data set.

We need to predict the “Label” column, a categorical feature with 2 levels.

Deep Learning model will be tasked to perform binary classification.

Deep Learning model uses all the predictors of the data set except “EventId”, and all of them are numerical.

First Deep Learning model will be only one epoch to get a feel for the model construction.

higgs_model_v1 = H2ODeepLearningEstimator(model_id = 'higgs_v1', epochs = 1, variable_importances = True)higgs_model_v1.

train(higgs_X, higgs_y, training_frame = train, validation_frame = valid)print(higgs_model_v1)We print out the model to investigate more:Figure 2There is quite a lot of information, but we have seen all of them in the past.

Error metrics on the training set like log-loss, mean per class error, AUC, Gini, MSE, RMSEConfusion matrix for max F1 thresholdThreshold values for different metricsGains / Lift tableThe validation set was also printed out:Figure 3The results from training set and validation set were pretty close.

So using our simplest deep learning model, we are getting about 0.

994 auc on validation set and 0.

995 auc on training set.

And log loss is 0.

09 on validation set and 0.

085 on training set.

Variable ImportancesWhen building classification models in H2O, we will be able to see the variable importance table in descending order of importance in Python like so:var_df = pd.


varimp(), columns = ['Variable', 'Relative Importance', 'Scaled Importance', 'Percentage'])var_df.

head(10)Figure 4Scoring HistoryTo look at the scoring history, we can use the score_history method to retrieve the data as a pandas Data Frame and then plot the classification error.

higgs_v1_df = higgs_model_v1.


plot(higgs_v1_df['training_classification_error'], label="training_classification_error")plt.

plot(higgs_v1_df['validation_classification_error'], label="validation_classification_error")plt.

title("Higgs Deep Learner")plt.

legend();Figure 5pred = higgs_model_v1.


as_data_frame(use_pandas=True)test_actual = test.

as_data_frame(use_pandas=True)['Label'](test_actual == pred['predict']).

mean()Figure 6The accuracy we achieved by this simple deep learning model is already 0.


Model 2To improve the results.

Now we run another, smaller network and we let it stop automatically once the misclassification rate converges (specifically if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events).

We also sample the validation set to 10,000 rows for faster scoring.

higgs_model_v2 = H2ODeepLearningEstimator(model_id = 'higgs_v2', hidden = [32, 32, 32], epochs = 1000000, score_validation_samples = 10000, stopping_rounds = 2, stopping_metric = 'misclassification', stopping_tolerance = 0.


train(higgs_X, higgs_y, training_frame = train, validation_frame = valid)Scoring HistoryTo look at the scoring history, we plot the classification error for our second model.

higgs_v2_df = higgs_model_v2.


plot(higgs_v2_df['training_classification_error'], label="training_classification_error")plt.

plot(higgs_v2_df['validation_classification_error'], label="validation_classification_error")plt.

title("Higgs Deep Learner (Early Stop)")plt.

legend();Figure 7Way better!And the accuracy was improved too.

pred = higgs_model_v2.


as_data_frame(use_pandas=True)test_actual = test.

as_data_frame(use_pandas=True)['Label'](test_actual == pred['predict']).

mean()Figure 8We are going to get the variable importances plot from the second model.


varimp_plot();Figure 9AutoML : Automatic Machine LearningAt last but not least, let’s try H2O’s AutoML.

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models, then we print out to see what models will be the top performing models in the AutoML Leaderboard.

from h2o.

automl import H2OAutoMLaml = H2OAutoML(max_models = 10, max_runtime_secs=100, seed = 1)aml.

train(higgs_X, higgs_y, training_frame = train, validation_frame = valid)aml.

leaderboardFigure 10AutoML has built 5 models inlcuding GLM (Generalized Linear Model), DRF (Distributed Random Forest) and XRT (Extremely Randomized Trees) and two stacked ensemble models (the 2nd and 3rd) and the best model is XRT.

It turns out, my proud deep learning models are not even on the Leaderboard.

Jupyter notebook can be found on Github.

Enjoy the rest of the week.

Reference: H2O Deep Learning Doc.

. More details

Leave a Reply