Start Now Login
Gradient Boosting Libraries — A Comparison

Gradient Boosting Libraries — A Comparison

Gradient Boosted Machines — A residuals’ worst enemy

Gradient boosting is one of the most used machine learning algorithms in data science. Residual-based boosting thereby leads the lot in predictive accuracy and adaptability, making it a usual suspect among Kaggle winning solutions and also widely used in industry for problems across ranking, classification and regression tasks. Since the first implementation in scikit-learn (2012) and XGBoost (2014), a number of new Python libraries have emerged: h2o (2015), LightGBM (2016) and CatBoost (2017). This leaves data enthusiasts today spoilt for choosing between the different libraries.

In this post, I outline how implementations differ and compare them along with general criteria and feature support. If you are considering gradient-boosted decision trees (GBDT), I hope this will help in understanding the capabilities and the trade-offs associated with a particular library to determine the best one for your use case.

Residual-based boosting in a nutshell

Residual-based boosting works by combining predictions from many models (ensemble of weak learners), each focusing on the weaknesses of the last model by fitting its residuals. The weighted combination of predictions thereby sequentially learns the more difficult patterns in the data and builds an increasingly accurate model that predicts “the hell out of your data”.

1) Scope of the libraries

One of the obvious differences across the implementations is the scope of the libraries. XGBoost, LightGBM and CatBoost are boosting-specific implementations, whereas scikit-learn and h2o are universal modelling libraries that cover a much larger set of models.

2) Supported boosting types

Another key difference is the boosting types they support. In general, boosting types differ in the way they control over-fitting (e.g. regularisation) and how they implement stochastic gradient descent (e.g. sub-sampling).

The Gradient Boosted Trees implementation in scikit-learn is the most generic implementation of GBDTs and based on Friedman (1, 2), who laid the groundwork for a new generation of boosting algorithms. XGBoost is a more optimised version of the latter which builds trees in parallel, rather than sequential as is in scikit-learn. It also has more fine-grained options for regularisation and over-fitting controls of the tree, e.g. decision trees with dropout (DART), which aims to prevent over-fitting by using a random subset of trees to calculate the residuals that the next tree will fit.

boosting types across librariesFigure 1: Comparing supported boosting types across libraries

LightGBM and CatBoost are also optimised versions of generic GBDTs. In addition to generic GBDT and DART, LightGBM offers support for gradient-based one-side sampling (GOSS), which gives higher sampling weights to data points with larger gradients (slope of the tangent of the loss function) before each iteration. This speeds up the training process by excluding a large portion of small gradient points.

CatBoost has two boosting schemes – ordered and plain gradient boosted trees. The latter is the generic GBDT implementation. The default is ordered, which aims to avoid biased gradients by excluding the current observations for the model that calculates the current gradients. (See their latest paper for details).

h2o on the other hand currently has two ways of running gradient boosting. H2OGradientBoostingEstimator follows the generic GBDT algorithm specified in the Elements of Statistical Learning. Further, an implementation of XGBoost was recently added that supports DART, GLM and generic GBDT (See details on their blog and documentation).

3) Splitting strategies

Searching for the best split is the key computational burden in tree-learning. Since there is no analytical solution to determine which feature to split on next, the algorithm has to search through all available features and choose the best at each stage. A lot of innovation has come from this area recently and resulted in some key differences in the way a tree is grown in an implementation. Whereas scikit-learn and CatBoost build large symmetric trees, LightGBM and XGBoost tend to build more deep asymmetric trees. (h2o supports both.)

3.1) Growing policies: wide vs. best

If you were to draw a tree with a root node and 6 leafs, how would you build it?

Expand in predetermined order

Depth-wise tree

Figure 2: Depth-wise/ Level-wise expansion

If you expand it level by level, you grow the tree in a fixed order, i.e. expanding nodes from left to right, one level at a time.This is called “depth-first” expansion since we expand the tree along the same level. Traditional versions of decision tree algorithms such as C4.5 or CART (Classification and Regression Trees) follow a depth-first strategy. An advantage of it is that a balanced tree serves as regularisation and thus works well in smaller data where it prevents overfitting. CatBoost uses a special type of depth-first expansion called oblivious trees. The key idea is that it uses a vectorised representation of the tree (binarised splitting criteria within each level), which can be evaluated faster.

Expand wherever it’s best

 Best-first treeFigure 3: Best-first/ Leaf-wise expansion

Another expansion method is to expand whatever leaf is best (marked green in Figure 3). This is called “best-first” learning since the tree is expanded at the node with the largest reduction in the tree building criterion (e.g. Gini). Best-first decision trees were introduced by Friedman et. al. (2000) and any traditional decision trees can be adapted to this growing policy.

Why is best-first faster?

Since best-first expands nodes based on their contribution to the global loss (Figure 3) and not just locally (Figure 2) on a particular level, nodes with less potential are left unexpanded. By stopping growth strategically, we are thus more likely to focus on the important nodes faster and thereby build a tree with better information gain, same amount of leaves but at lower computational cost.

Importance for bigger data

This really starts paying off for >10k rows, where best-first tends to converge much faster. On smaller datasets, best-first tends to overfit, which can be contained by limiting the max_depth parameter.

How to use it

When working with Python libraries, growth_policy is the parameter to tune in practice. Best-first learning is now supported in LightGBM, XGBoost, and h2o. It isn’t supported in CatBoost and scikit-learn.

3.2) Splitting strategies: exact vs. approximate

Decision tree algorithms work by iteratively separating training data into smaller and smaller subsets. The decision rule to optimise the training objective is thereby searching for the best column to split and where to split it best. For high-cardinality features (e.g. continuous), iterating through all possible splitting points (“exact” evaluation) leads to significant computational cost.

Approximate methods, such as histogram-based binning avoid searching all possible splits by discretising continuous values into discrete bins. This reduces the computational complexity of exact splitting significantly since searching over bins is both faster and more memory-efficient.

How to use it

LightGBM was the first to include histogram-based splitting as default. XGBoost, h2o and CatBoost support it optionally. Scikit-learn does not support approximate splitting.

4) Scalability

When libraries are compared, speed and accuracy are often the first focus. Three interesting benchmarks have been published recently, which I would like to mention here. Unfortunately, not all libraries will be covered equally and I cannot guarantee that boosting schemes covered were comparable.

— User benchmarks

In an application for price prediction, Connor McNamara compared GBM-specific libraries (LightGBM, CatBoost, and XGBoost).

— Findings

“In terms of model accuracy, XGBoost was the clear winner in both GridSearchCV and RandomizedSearchCV, with the lowest RMSE. For early stopping, LightGBM was the winner, with a slightly lower RMSE than XGBoost.

Overall, CatBoost was the clear underperformer, with training times comparable to XGBoost while having the worst predictions in terms of RMSE.”

Connor McNamaraFigure 4: Reproduced from Connor McNamara

Some other interesting benchmarks have been done on the airline dataset (100m rows). Szilar benchmarked XGBoost against LightGBM and h2o on a 10k row sample of the data.

For CPU implementations, he found h2o had the best performance (followed by XGBoost) in terms of AUC, while LightGBM was fastest (2x) at a similar AUC performance. For GPU implementations, h2o crashed and XGBoost was both the fastest and most accurate in terms of AUC.

Interestingly, the airlines dataset was also benchmarked against XGBoost by Microsoft last year, where LightGBM came out fastest and also marginally better in terms of AUC.

— XGBoost.ai benchmark

In a recent blog post by xgboost.ai, CPU and GPU implementations of the XGBoost histogram version, LightGBM and CatBoost were compared on different datasets, including the above-mentioned airline dataset.

— Findings

XGBoost.aiFigure 5: Reproduced from XGBoost.ai blog post

The result showed XGBoost and LightGBM being very similar in terms of accuracy, both for CPU and GPU implementations (Unfortunately, F1 score or AUC was not provided). In terms of speed, LightGBM was fastest for CPU implementations, while XGBoost had the fastest GPU implementation.

5) General criteria

In practice, speed and accuracy may not be the determining factors when choosing a certain tool over another. For example, the cost of false positives in industry applications is often high and accuracy will be more important than speed. Similarly, more interpretable, simple models which are less accurate may be preferable to complex models that are hard to interpret.

One might also look at how consistent the library is from version to version. In a previous post on model consistency, for example, I found that there can be considerable variation in results when running the same model but under a different version of the same library. This is particularly the case for CatBoost and LightGBM, which had the best model performance but were also most sensitive to changes such as introducing NaN values and smaller datasets.

Last but not least, companies choose what fits into their toolchain the quickest, whichever is documented best or actively maintained. These and other general criteria are summarised in the below table.

General comparison across librariesFigure 6: General comparison across libraries

The second table below compares the libraries from a more feature-specific angle.

General comparison across librariesFigure 7: Comparing feature support across libraries

Endnotes

Open source gradient boosting libraries are evolving at a rapid pace. In this post, I compared the most used libraries for GBM in terms of some key implementation differences, feature support and performance benchmarks. Going beyond speed and accuracy, I also included a comparison of general criteria that should help you think about some of these in your context.

Ultimately, once you’re clear of the objective and constraints, nothing goes over doing your own benchmark. Since CatBoost, LightGBM and XGBoost all have scikit-learn wrappers, you can do so very quickly (see example of this here).

My only goal is to gradient boost over myself of yesterday. And to repeat this every day with an unconquerable spirit. (Mike Kim)

Further reading

Check out this awesome parameter comparison between XGBoost & LightGBM by Lauraepp. You will learn a lot!

Related content
3 Things to Avoid When Setting Up an Amazon Redshift Cluster Apache Spark vs. Amazon Redshift: Which is better for big data? Amazon Redshift Spectrum: Diving into the Data Lake! What Causes "Serializable Isolation Violation Errors" in Amazon Redshift? A Quick Guide to Using Short Query Acceleration and WLM for Amazon Redshift for Faster Queries What is TensorFlow? An Intro to The Most Popular Machine Learning Framework Titans of Data with Mirko Novakovic - How Containers are Giving Rise to New Data Services Why We Built intermix.io - “APM for Data” 4 Simple Steps To Set-up Your WLM in Amazon Redshift For Better Workload Scalability World-class Data Engineering with Amazon Redshift - Training Announcing App Tracing - Monitoring Your Data Apps With intermix.io Have Your Postgres Cake with Amazon Redshift and eat it, too. 4 Real World Use Cases for Amazon Redshift 3 Steps for Fixing Slow Looker Dashboards with Amazon Redshift Zero Downtime Elasticsearch Migrations Titans of Data with Florian Leibert – CEO Mesosphere Improve Amazon Redshift COPY performance:  Don’t ANALYZE on every COPY Building a Better Data Pipeline - The Importance of Being Idempotent The Future of Machine Learning in the Browser with TensorFlow.js Crowdsourcing Weather Data With Amazon Redshift The Future of Apache Airflow Announcing Query Groups – Intelligent Query Classification Top 14 Performance Tuning Techniques for Amazon Redshift Product Update: An Easy Way To Find The Cause of Disk Usage Spikes in Amazon Redshift How We Reduced Our Amazon Redshift Cost by 28%
Ready to start seeing into your data infrastructure?
Get started with a 14-day free trial, with access to the full platform

No Credit Card Required