Build XGBoost / LightGBM models on large datasets — what are the possible solutions?

Please comment below.

Before I dive into these tools, there’re a few things good to know beforehand.

XGBoost vs LightGBMXGBoost is a very fast and accurate ML algorithm, but it’s now challenged by LightGBM — which runs even faster (for some datasets, it’s 10X faster based on their benchmark), with comparable model accuracy, and more hyperparameters for users to tune.

The key difference in speed is because XGBoost split the tree nodes one level at a time, and LightGBM does that one node at a time.

So XGBoost developers later improved their algorithms to catch up with LightGBM, allowing users to also run XGBoost in split-by-leaf mode (grow_policy = ‘lossguide’).

Now XGBoost is much faster with this improvement, but LightGBM is still about 1.

3X — 1.

5X the speed of XGB based on my tests on a few datasets.

(Welcome to share your test outcomes!)The readers can go with either option with their own preference.

One more thing to add here: XGBoost has a feature that LightGBM lacks — “monotonic constraint”.

It will sacrifice some model accuracy and increase training time, but may improve model interpretability.

(Reference: https://xgboost.



html and https://github.

com/dotnet/machinelearning/issues/1651)Find the “sweet spot” in gradient boosting treesFor random forest algorithm, the more trees built, the less variance the model is.

But up to some point, you can’t really improve the model further by adding in more trees.

XGBoost and LightGBM do not work this way.

the model accuracy keeps improving when number of trees increases, but after certain point the performance begins to drop — a sign of overfitting; and the performance gets worse with more trees built.

In order to find the ‘sweet spot’, you can do cross validations or simply do training-validation set splitting, and then use early stopping time to find where it should stop training; or, you can build a few models with different number of trees (say 50, 100, 200), and then pick the best one among them.

If you don’t care about extreme performance, you can set a higher learning rate, build only 10–50 trees (say).

It may under-fit a bit but you still have a pretty accurate model, and this way you can save time finding the optimal number of trees.

Another benefit with this approach is the model is simpler (fewer trees built).


XGBoost4j on Scala-SparkIf the reader plans to go with this option, https://xgboost.



html is a good starting point.

I’d like point out a few issues here (as of this article is posted):XGBoost4j doesn’t support Pyspark.

XGBoost4j doesn’t support split-by-leaf mode, making it way slower.


com/dmlc/xgboost/issues/3724Because it’s on Spark, all missing values have to be imputed (vector assembler doesn’t allow missing values).

And this may reduce model accuracy.




htmlEarly stopping may still contain bugs.

If you follow their latest releases in https://github.

com/dmlc/xgboost/releases, you’ll find they’re still fixing these bugs recently.


LightGBM on Spark (Scala / Python / R)The major issues based on my personal experience:Lack of documentation and good examples.



mdAll missing values have to be imputed (similar to XGBoost4j)I also had issues in “Early stopping” parameter in spark cross validator.

(To test if it’s working properly, pick a smaller dataset, pick a very large number of rounds with early stopping = 10, and see how long it takes to train the model.

After it’s trained, compare the model accuracy with the one built using Python.

If it overfits badly, it’s likely that early stopping is not working at all.

)Some example codes (not including vector assembler):from mmlspark import LightGBMClassifierfrom pyspark.


evaluation import BinaryClassificationEvaluatorfrom pyspark.


tuning import CrossValidator, ParamGridBuilderlgb_estimator = LightGBMClassifier(learningRate=0.

1, numIterations=1000, earlyStoppingRound=10, labelCol="label")paramGrid = ParamGridBuilder().


numLeaves, [30, 50]).

build()eval = BinaryClassificationEvaluator(labelCol="label",metricName="areaUnderROC")crossval = CrossValidator(estimator=lgb_estimator, estimatorParamMaps=paramGrid, evaluator=eval, numFolds=3) cvModel = crossval.

fit(train_df[["features", "label"]])3.

XGBoost on H2O.

aiThis is my personal favorite solution.

The model can be built using H2O.

ai, integrated in a Pysparkling Water (H2O.

ai + PySpark) pipeline:https://www.


net/0xdata/productionizing-h2o-models-using-sparkling-water-by-jakub-havaIt’s easy to build a model with optimized number of rounds with cross validations -# binary classificationfeatures = ['A', 'B', 'C']train['label'] = train['label'].

asfactor() # train is an H2O framecv_xgb = H2OXGBoostEstimator( ntrees = 1000, learn_rate = 0.

1, max_leaves = 50, stopping_rounds = 10, stopping_metric = "AUC", score_tree_interval = 1, tree_method="hist", grow_policy="lossguide", nfolds=5, seed=0)cv_xgb.

train(x = features, y = 'label', training_frame = train)And the XGBoost model can be saved and used in Python with cv_xgb.

save_mojo() .

Use h2o.

save_model() if you’d like to save the model in h2o format instead.

My only complaint about it is that the saved model (the one saved with save.

mojo) can’t be used with SHAP package to generate SHAP feature importance (But XGBoost feature importance, .

get_fscore() , works fine).

Seems like there’re some issues with the original XGBoost package.




XGBoost on SageMakerThis is a pretty new solution by AWS.

The two main features are automatic hyperparameter tuning with Bayesian optimization, and the model can be deployed as an endpoint.

A few examples can be found on their Github: https://github.


Here are some of my concerns with it:The parameter tuning tools are less user (data scientists)-friendly compared to other solutions:https://github.


ipynb andhttps://github.


ipynbWhether Bayesian optimization is the best option to tune XGB parameters is still unknown.

If you check out the papers, gradient boosting trees are not mentioned / tested.





htmlThe parameter is tuned with a single validation set, not cross validations.

I haven’t figured out how to use the model trained with its built-in XGBoost algorithm in Python.

But other than these issues, we can still leverage its endpoint feature.

You can train your XGB model anywhere, put it in XGBoost image from Amazon ECR (Elastic Container Registry), and then deploy it as an endpoint.

* * * * *XGBoost / LightGBM are rather new ML tools, and they both have the potentials to become stronger.

The developers already did a fantastic job creating these tools to make people’s life easier.

I point out some of my observations and share my experience here, with the hope that they can become even better and more easy-to-use tools.

* * * * *My other posts in Toward Data Science:A step-by-step guide for creating advanced Python data visualizations with Seaborn / Matplotlib10 Python Pandas tricks that make your work more efficientAn interesting and intuitive view of AUCPlotting decision boundaries in 3D — Logistic regression and XGBoostXGBoost deployment made easy.

. More details

Leave a Reply