# Introduction to gradient boosting on decision trees with Catboost

Boosting focuses on misclassified tuples, it risk overfitting the resulting composite model to such data.

• Greedy algorithm for construction linear models• Each of the following algorithms is constructed to correct the errors of existed ensemble.

Different approximations used for threshold loss functionAs an example take the standard regression task with MSE as a loss functionUsually as a base algorithm we take the most simple one, for instance we can take a short decision treeThe second algorithm must be fitted in a way to minimize the error of the composition b1(x) and b2(x)And as for bN(x)Gradient boostingGradient boosting is known to be one of the leading ensemble algorithm.

Gradient boosting algorithm uses gradient descent methond to optimize the loss function.

It is iterative algorithm and the steps are following:Initialise the first simple algorithm b0On each iteration we make a shift vector s = (s1,.

sl).

si — the values of the algorithm bN(xi) = si on a training sample3.

Then the algorithm would be4.

Finally we add the algorithm bN to the ensembleThere are several gradient boosting libraries available: XGBoost, H20, LightGBM.

The main difference between them in the tree structure, feature engineering and working with sparse dataCatboostCatboost can be used for solving problems, such as regression, classification, multi-class classification and ranking.

Modes differ by the objective function, that we are trying to minimize during gradient descend.

Moreover, Catboost have pre-build metrics to measure the accuracy of the model.

On official catboost website you can find the comparison of Catboost (method) with major benchmarksFigures in this table represent Logloss values (lower is better) for Classification mode.

Percentage is metric difference measured against tuned CatBoost results.

Catboost advantageCatboost introduces the followign algorithmic advances:An innovative algorithm for processing categorical features.

No need to preprocess features on your own — it’s performed out of the box.

For data with categorical features the accuracy would be better compare to other algorithm.

The implementation of ordered boosting, a permutation-driven alternative to the classic bosting algorithm.

On small datasets, the GB is quickly overfitted.

In Catboost there is a special modification for such cases.

That is, on those datasets where other algorithms had a problem with overfitted you won’t observe the same problem on CatboostFast and easy to-use GPU-training.

You can simply install it via pip-installOther useful features: missing value support, great visualizationCategorical featuresCategorical feature is one with a discrete set of values called categories that are not comparable to each other.

The main advantage of catboost is a smart preprocessing of categorical data.

You don’t have to preprocess data on your own.

Some of the most popular practices to encode categorical data are:One-hot encodingLabel encodingHashing encodingTarget encodingand etc.

One-hot encoding is a popular approach for the categorical features with small number of distinct features.

Catboost use one_hot_max_size for all features with number of different values less than or equal to the given parameter value.

In the case features with high cardinality(like, e.

g.

, “user ID” feature), such technique leads to infeasibly large number of new features.

Another popular method is to group categories by target statistics (TS) that estimate expected target value in each category.

The problem of such greedy approach is target leakage: the new feature is computed using target of the previous one.

This leads to a conditional shift — the distribution differes for training and test examples.

The common methods for solving this problem are holdout TS and leave-one out TS.

But still they doesn’t prevent model from target leakage.

CatBoost uses a more effective strategy.

It relies on the ordering principle and called Target-Based with prior (TBS).

It is inspired by online learning algorithms which get training examples sequentially in time.

The values of TS for each example rely only on the observed history.

To adapt this idea to standard offline setting, we introduce an artificial “time”, i.

e.

, a random permutation σ of the training examples.

In Catboost, the data is randomly shuffled and mean is calculated for every object only on its historical data.

Data can be reshuffled multiple times.

Another important detail of CatBoost is using combinations of categorical features as additional categorical features which capture high-order dependencies like joint informa- tion of user ID and ad topic in the task of ad click prediction.

The number of possible combinations grows exponentially with the number of categorical features in the dataset, and it is infeasible to process all of them.

CatBoost constructs combinations in a greedy way.

Namely, for each split of a tree, CatBoost combines (concatenates) all categorical features (and their combinations) already used for previous splits in the current tree with all categorical features in the dataset.

Combinations are converted to TS on the fly.

Fighting gradient biasesCatBoost implements an algorithm that allows to fight usual gradient boosting biases.

The existed implementations face the statistical issue, prediction shift.

The distribution F(x_k) | x_k for a training example shifts from the distribution of F(x) | x for a test example x.

This problem is similar to the one that occurs in preprocessing of categorical variables, that was described above.

The Catboost team derived ordered boosting, a modification of standard gradient boosting algorithm, that avoid target leakage.

CatBoost has two boosting modes, Ordered and Plain.

The latter mode is the standard GBDT algorithm with inbuilt ordered TS.

You can find a detailed description of the algorithm in the paper Fighting biases with dynamic boostingCatBoost uses oblivious decision trees, where the same splitting criterion is used across an entire level of the tree.

Such trees are balanced, less prone to overfitting, and allow speeding up prediction significantly at testing time.

Here is the implementation of oblivious tree evaluation in Catboost:int index = 0; for (int depth = 0; depth < tree.

ysize(); ++depth) { index |= binFeatures[tree[depth]] << depth; } result += Model.

LeafValues[treeId][resultId][index];As you can see, there are no “if” operators in this code.

You don’t need branches to evaluate an oblivious decision tree.

An oblivious decision tree can be described as a list of conditions, one condition per layer.

With oblivious trees you need just to evaluate all tree’s conditions, compose binary vector from them, convert this binary vector to the number and access leafs array by the index equals to this number.

For example in LightGBM (XgBoost has similar impelementation)std::vector<int> left_child_;std::vector<int> right_child_;inline int NumericalDecision(double fval, int node) const { .

if (GetDecisionType(decision_type_[node], kDefaultLeftMask)) { return left_child_[node]; } else { return right_child_[node]; }.

} inline int Tree::GetLeaf(const double* feature_values) const {.

while (node >= 0) { node = NumericalDecision(feature_values[split_feature_[node]], node); }.

}In the Ordered boosting mode, during the learning process, we maintain the supporting models Mr,j , where Mr,j(i) is the current prediction for the i-th example based on the first j examples in the permutation σr.

At each iteration t of the algorithm, we sample a random permutation σr from{σ1, .

.

.

, σs} and construct a tree Tt on the basis of it.

First, for categorical features, all TS are computed according to this permutation.

Second, the permutation affects the tree learning procedure.

Based on Mr,j(i), the corresponding gradients are computed.

While constructing a tree, we approximate the gradient G in terms of the cosine similarity where for each example i, we take the gradient based on the previous examples is σs.

When the tree structure Tt (i.

e.

, the sequence of splitting attributes) is built, we use it to boost all the models Mr′,jThe detailed information you can find in the original paper or in the NIPS’18 slidesGPU trainingCatBoost can be efficiently trained on several GPUs in one machine.

Experimental result for different hardwareCatBoost achieves good scalability.

On 16 GPUs with InfiniBand, CatBoost runs approximately 3.

75 faster than on 4 GPUs.

The scalability should be even better with larger datasets.

If there is enough data, we can train models on slow 1GbE networks, as two machines with two cards per machine are not significantly slower than 4 GPUs on one PCIe root complex.

You can read more in this NVIDIA articleAmong described advantages also need to mention the following one:Overfitting detector.

Usually in gradient boosting we adjust learning rate to the stable accuracy.

But the smaller learning rate — more iterations needed.

Missing variables.

Just left NaNIn Catboost you are able to write your own loss functionFeature importanceCatBoost provides tools for the Python package that allow plotting charts with different training statistics.

This information can be accessed both during and after the training procedureMonitor training in iPython Notebook using our visualization tool CatBoost Viewer.

Catboost model can be integrated into Tensorflow.

For example, it is a common case for combining Catboost and Tensorflow together.

Neural network can be used for feature extraction for gradient boosting.

Also, now Catboost model can be used in the production with the help of CoreML.

ExamplesI created an example of applying Catboost for solving regression problem.

I used data from Allstate Claims Severity as a basement.

Feel free to use my colab in your further research!Also you can find a plenty of other examples in official Catboost’s githubContributionIn case if you want to make CatBoost better:Check out help wanted issues to see what can be improved, or open an issue if you want something.