Boosting your way to the top with XGBoost ????A digestible paper review of XGBoostSteven LiuBlockedUnblockFollowFollowingApr 3XGBoost was used in the winning solution of the Higgs Boson Machine Learning ChallengeChances are, you’ve heard of XGBoost (eXtreme Gradient Boosting).
It is one of the most popular machine learning (ML) algorithms in recent years, starred and forked thousands of times on Github.
XGBoost is used in production by real-world companies, and has become a standard choice for participants in Kaggle competitions.
In this article, I will provide a layman’s review of the original paper by Tianqi Chen and Carlos Guestrin (read the paper here).
IntroductionToday, we generate tremendous amounts of data.
This data is used for many applications such as classifying medical images, predicting the lowest airfare, and creating recommendations for media and product consumption.
There are two important criteria for these applications:Be able to capture the patterns within the dataCapable of scaling to handle immense quantities of dataXGBoost is a scalable ML system for tree boosting that meets the above criteria, and yields state-of-the-art results on a wide range of problems.
What distinguishes XGBoost from other ML model variants is its scalability to massive datasets, and ability to run greater than 10X faster than existing solutions.
This scalability and speed is the result of several tweaks:a weighted quantile sketch procedure to handle instance weightsa novel algorithm for handling sparse dataand a cache-aware block structure for out-of-core computation to enable efficient processing of hundreds of millions of examplesDon’t worry if you don’t understand what this means right now, the impact of these changes will become more apparent as you read on!Tree boosting for people in a hurryLet’s review what we need to know about gradient tree boosting.
If you aren’t familiar with decision trees, check out this section from The Mechanics of Machine Learning, by Terence Parr and Jeremy Howard.
Regularized learning objectiveAs with most models, minimizing loss — the difference between the prediction yᵢ and the target y — is our primary objective.
We want our model to be as accurate as possible.
It’s also important to add a regularization term (Ω) to prevent overfitting — a situation where the model ends up memorizing the dataset, and as a result, doesn’t perform accurately on new datasets.
We could express the objective succinctly with the following expression:Formula for the regularized learning objective (source)We also add two additional regularization techniques to prevent overfitting.
The first is shrinkage, which adds a scaling factor to new weights after each boosting step.
This prevents a few individual trees from dominating the ensemble, leaving room for improvement.
The second technique is feature subsampling, where a fraction of features are sampled at each step, to increase the variety of trees built.
Feature subsampling also increases computation speed, allowing XGBoost to scale efficiently.
Gradient tree boostingWhen we train a ML model, it is common to optimize it by taking the gradient.
However, this is difficult to perform on an ensemble of trees.
Instead, we use an additive strategy, and add one new tree at a time, such that each tree we add optimizes our objective.
This gives us a scoring function that we can use to evaluate the quality of a tree structure.
An illustrated example for scoring the quality of a tree structure (source)In a nutshell, gᵢ and hᵢ are sorted to their appropriate leaves, summed, and then scored with the scoring function to determine how good the tree is.
Now that we have a method for assessing the quality of the tree structure, we need to learn the tree structure.
As we mentioned before, it is impossible to complete this in one step for all possible trees, so we optimize one tree level at a time.
Beginning with a split in a leaf, we can calculate the gain, which is given by:Formula for calculating the loss at a split (source)In plain old English, this formula just tells us to:Calculate score on the new left leafCalculate score on the new right leafCalculate score on the original leafAdd a regularization term γThe important thing to note here is that if the gain < γ, then we would know not to add that branch, and we can prune it from our tree.
Split-finding algorithmsTo find the best split, it is common to use an exact greedy algorithm, which calculates all possible splits on all the features.
This is a powerful and thorough method, but as you can probably guess, it is computationally expensive and impossible when the data does not fit into memory.
In order to meet the second criteria for scalability, we use an approximate algorithm instead.
Weighted quantile sketchWhile exact greedy algorithms consider all splitting points, an approximate algorithm only considers candidate split points.
This is done by using quantiles to determine where to split, and all candidates are distributed evenly across the data.
In other words, if we have 100 possible split points, we can get a pretty good approximation, ϵ, with 10-quantile split points.
Now, each approximated point is weighted by hᵢ.
The reason for this is because if every point has equal weights, it would be meaningless to split between them.
Hence, we use weighted quantiles, so that the first 10-quantile is the first point that is larger than 10% of the weights, instead of the points.
Sparsity awarenessIn real-world problems, we often encounter sparse datasets as a result of missing data or feature engineering.
XGBoost is able to handle the sparsity by learning the best imputation value from the data for the missing values.
The best imputation value is then chosen by the value that results in the greatest reduction in training loss.
The impact is clearly demonstrated when comparing the sparsity-aware algorithm, and a naive algorithm on a very sparse dataset.
XGBoost is able to run 50X faster!Run time comparison of sparsity-aware and basic algorithm (source)XGBoost designSorting all the data in order during tree learning is a very time consuming process.
To reduce the cost of sorting, XGBoost stores data in block structures.
Data in each block is stored in a compressed column format, and each column is sorted according to its corresponding feature value.
As a result, XGBoost can use multiple blocks when using approximate algorithms, enabling a parallelized learning process.
This block structure greatly reduces the complexity of split finding.
Linear scan over presorted columns to find best split (source)Using the sorted block structure, the weighted quantile step simply becomes a left to right scan of the sorted columns to find the best split.
Cache-aware accessWhile this new block structure is great, we still need to access the values stored in them, which is a non-continuous process.
Problems occur when the values don’t fit in the CPU cache, and can cause the algorithm to stall.
To overcome this issue, XGBoost selects for the correct block size.
The correct block size is simply set to the maximum number of values contained in a block, which makes sense, since it represents the cache storage cost.
As a result, XGBoost is able to achieve a good balance between efficient parallelization and preventing stalling.
Out-of-core computationRemember that we are looking for scalable ML systems, which means we want to take advantage of all the resources a machine has to offer.
XGBoost accomplishes this by using the disk space, to handle any data that does not fit into the main memory.
As we mentioned earlier, data is divided into multiple blocks, and each block is stored on disk.
To improve out-of-core computation, XGBoost uses:Block compression: remember that the blocks are stored in a compressed column format.
When they are fetched into the main memory, they are decompressed, which helps to alleviate some of the computational cost in decompression with disk reading cost.
Block sharding: when you have multiple disks, XGBoost alternatively stores the blocks on each disk.
The blocks are then fetched into the main memory in a step-like manner from each disk, which speeds up the process of disk reading.
SummaryIf you haven’t already, you should consider including XGBoost in your ML toolbox.
It checks all the boxes required for modern ML challenges, mainly being able to rapidly scale toward large datasets, and using computational resources efficiently.
Head on over here and give it a try!ReferencesChen, T.
, & Guestrin, C.
Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp.
Thanks for reading!.Stay tuned for more as I continue on my path to become a data scientist!.✌️Also, feel free to connect with me on my LinkedIn here.