Gradient Boosting Libraries — A Comparison
Gradient Boosted Machines — A residuals’ worst enemy
Gradient boosting is one of the most used machine learning algorithms in data science. Residual-based boosting thereby leads the lot in predictive accuracy and adaptability, making it a usual suspect among Kaggle winning solutions and also widely used in industry for problems across ranking, classification and regression tasks. Since the first implementation in scikit-learn (2012) and XGBoost (2014), a number of new Python libraries have emerged: h2o (2015), LightGBM (2016) and CatBoost (2017). This leaves data enthusiasts today spoilt for choosing between the different libraries.
In this post, I outline how implementations differ and compare them along with general criteria and feature support. If you are considering gradient-boosted decision trees (GBDT), I hope this will help in understanding the capabilities and the trade-offs associated with a particular library to determine the best one for your use case.
Residual-based boosting in a nutshell
Residual-based boosting works by combining predictions from many models (ensemble of weak learners), each focusing on the weaknesses of the last model by fitting its residuals. The weighted combination of predictions thereby sequentially learns the more difficult patterns in the data and builds an increasingly accurate model that predicts “the hell out of your data”.
1) Scope of the libraries
One of the obvious differences across the implementations is the scope of the libraries. XGBoost, LightGBM and CatBoost are boosting-specific implementations, whereas scikit-learn and h2o are universal modelling libraries that cover a much larger set of models.
2) Supported boosting types
Another key difference is the boosting types they support. In general, boosting types differ in the way they control over-fitting (e.g. regularisation) and how they implement stochastic gradient descent (e.g. sub-sampling).
The Gradient Boosted Trees implementation in scikit-learn is the most generic implementation of GBDTs and based on Friedman (1, 2), who laid the groundwork for a new generation of boosting algorithms. XGBoost is a more optimised version of the latter which builds trees in parallel, rather than sequential as is in scikit-learn. It also has more fine-grained options for regularisation and over-fitting controls of the tree, e.g. decision trees with dropout (DART), which aims to prevent over-fitting by using a random subset of trees to calculate the residuals that the next tree will fit.
Figure 1: Comparing supported boosting types across libraries
LightGBM and CatBoost are also optimised versions of generic GBDTs. In addition to generic GBDT and DART, LightGBM offers support for gradient-based one-side sampling (GOSS), which gives higher sampling weights to data points with larger gradients (slope of the tangent of the loss function) before each iteration. This speeds up the training process by excluding a large portion of small gradient points.
CatBoost has two boosting schemes – ordered and plain gradient boosted trees. The latter is the generic GBDT implementation. The default is ordered, which aims to avoid biased gradients by excluding the current observations for the model that calculates the current gradients. (See their latest paper for details).
h2o on the other hand currently has two ways of running gradient boosting. H2OGradientBoostingEstimator follows the generic GBDT algorithm specified in the Elements of Statistical Learning. Further, an implementation of XGBoost was recently added that supports DART, GLM and generic GBDT (See details on their blog and documentation).
3) Splitting strategies
Searching for the best split is the key computational burden in tree-learning. Since there is no analytical solution to determine which feature to split on next, the algorithm has to search through all available features and choose the best at each stage. A lot of innovation has come from this area recently and resulted in some key differences in the way a tree is grown in an implementation. Whereas scikit-learn and CatBoost build large symmetric trees, LightGBM and XGBoost tend to build more deep asymmetric trees. (h2o supports both.)
3.1) Growing policies: wide vs. best
If you were to draw a tree with a root node and 6 leafs, how would you build it?
Expand in predetermined order
Figure 2: Depth-wise/ Level-wise expansion
If you expand it level by level, you grow the tree in a fixed order, i.e. expanding nodes from left to right, one level at a time.This is called “depth-first” expansion since we expand the tree along the same level. Traditional versions of decision tree algorithms such as C4.5 or CART (Classification and Regression Trees) follow a depth-first strategy. An advantage of it is that a balanced tree serves as regularisation and thus works well in smaller data where it prevents overfitting. CatBoost uses a special type of depth-first expansion called oblivious trees. The key idea is that it uses a vectorised representation of the tree (binarised splitting criteria within each level), which can be evaluated faster.
Expand wherever it’s best
Figure 3: Best-first/ Leaf-wise expansion
Another expansion method is to expand whatever leaf is best (marked green in Figure 3). This is called “best-first” learning since the tree is expanded at the node with the largest reduction in the tree building criterion (e.g. Gini). Best-first decision trees were introduced by Friedman et. al. (2000) and any traditional decision trees can be adapted to this growing policy.
Why is best-first faster?
Since best-first expands nodes based on their contribution to the global loss (Figure 3) and not just locally (Figure 2) on a particular level, nodes with less potential are left unexpanded. By stopping growth strategically, we are thus more likely to focus on the important nodes faster and thereby build a tree with better information gain, same amount of leaves but at lower computational cost.
Importance for bigger data
This really starts paying off for >10k rows, where best-first tends to converge much faster. On smaller datasets, best-first tends to overfit, which can be contained by limiting the max_depth parameter.
How to use it
When working with Python libraries, growth_policy is the parameter to tune in practice. Best-first learning is now supported in LightGBM, XGBoost, and h2o. It isn’t supported in CatBoost and scikit-learn.
3.2) Splitting strategies: exact vs. approximate
Decision tree algorithms work by iteratively separating training data into smaller and smaller subsets. The decision rule to optimise the training objective is thereby searching for the best column to split and where to split it best. For high-cardinality features (e.g. continuous), iterating through all possible splitting points (“exact” evaluation) leads to significant computational cost.
Approximate methods, such as histogram-based binning avoid searching all possible splits by discretising continuous values into discrete bins. This reduces the computational complexity of exact splitting significantly since searching over bins is both faster and more memory-efficient.
How to use it
LightGBM was the first to include histogram-based splitting as default. XGBoost, h2o and CatBoost support it optionally. Scikit-learn does not support approximate splitting.
When libraries are compared, speed and accuracy are often the first focus. Three interesting benchmarks have been published recently, which I would like to mention here. Unfortunately, not all libraries will be covered equally and I cannot guarantee that boosting schemes covered were comparable.
— User benchmarks
In an application for price prediction, Connor McNamara compared GBM-specific libraries (LightGBM, CatBoost, and XGBoost).
“In terms of model accuracy, XGBoost was the clear winner in both GridSearchCV and RandomizedSearchCV, with the lowest RMSE. For early stopping, LightGBM was the winner, with a slightly lower RMSE than XGBoost.
Overall, CatBoost was the clear underperformer, with training times comparable to XGBoost while having the worst predictions in terms of RMSE.”
Figure 4: Reproduced from Connor McNamara
Some other interesting benchmarks have been done on the airline dataset (100m rows). Szilar benchmarked XGBoost against LightGBM and h2o on a 10k row sample of the data.
For CPU implementations, he found h2o had the best performance (followed by XGBoost) in terms of AUC, while LightGBM was fastest (2x) at a similar AUC performance. For GPU implementations, h2o crashed and XGBoost was both the fastest and most accurate in terms of AUC.
Interestingly, the airlines dataset was also benchmarked against XGBoost by Microsoft last year, where LightGBM came out fastest and also marginally better in terms of AUC.
— XGBoost.ai benchmark
In a recent blog post by xgboost.ai, CPU and GPU implementations of the XGBoost histogram version, LightGBM and CatBoost were compared on different datasets, including the above-mentioned airline dataset.
Figure 5: Reproduced from XGBoost.ai blog post
The result showed XGBoost and LightGBM being very similar in terms of accuracy, both for CPU and GPU implementations (Unfortunately, F1 score or AUC was not provided). In terms of speed, LightGBM was fastest for CPU implementations, while XGBoost had the fastest GPU implementation.
5) General criteria
In practice, speed and accuracy may not be the determining factors when choosing a certain tool over another. For example, the cost of false positives in industry applications is often high and accuracy will be more important than speed. Similarly, more interpretable, simple models which are less accurate may be preferable to complex models that are hard to interpret.
One might also look at how consistent the library is from version to version. In a previous post on model consistency, for example, I found that there can be considerable variation in results when running the same model but under a different version of the same library. This is particularly the case for CatBoost and LightGBM, which had the best model performance but were also most sensitive to changes such as introducing NaN values and smaller datasets.
Last but not least, companies choose what fits into their toolchain the quickest, whichever is documented best or actively maintained. These and other general criteria are summarised in the below table.
Figure 6: General comparison across libraries
The second table below compares the libraries from a more feature-specific angle.
Figure 7: Comparing feature support across libraries
Open source gradient boosting libraries are evolving at a rapid pace. In this post, I compared the most used libraries for GBM in terms of some key implementation differences, feature support and performance benchmarks. Going beyond speed and accuracy, I also included a comparison of general criteria that should help you think about some of these in your context.
Ultimately, once you’re clear of the objective and constraints, nothing goes over doing your own benchmark. Since CatBoost, LightGBM and XGBoost all have scikit-learn wrappers, you can do so very quickly (see example of this here).
My only goal is to gradient boost over myself of yesterday. And to repeat this every day with an unconquerable spirit. (Mike Kim)
Check out this awesome parameter comparison between XGBoost & LightGBM by Lauraepp. You will learn a lot!