Why XGBoost is better than GBM?
The best algorithm for tabular data!
Today we dig into what might be the most used Machine Learning model in the data science world: XGBoost! There was a point in my career when I was not anymore a machine learning modeler but instead, an XGBoost modeler: using other algorithms became simply unnecessary considering how well XGBoost performed! We are going to look at:
How to parallelize a Gradient Boosted Machine
How to build better trees
Github repos, similar models and Youtube videos
Supercharge your data science and machine learning career (Sponsored)
Get a step ahead of your competitors and move up the Kaggle leaderboards with The Kaggle Book and The Kaggle Workbook. Authored by two Kaggle Grandmasters, these books offer a practical approach to navigating the often-overwhelming world of Kaggle. Go deep into data science and machine learning fundamentals with The Kaggle Book. The Kaggle Workbook analyzes four Kaggle competitions on predicting driver behavior, modeling time-series sales data, building computer vision models to identify diseased leaves, and solving a Google natural language processing problem. Plus, purchase the print or Kindle book and receive a free eBook in PDF format.
Why did XGBoost become so popular? There are 2 aspects that made the success of XGBoost back in 2014. The first one is the regularized learning objective that allows for better pruning of the trees. The second one is the ability to distribute the Gradient Boosting learning process across multiple threads or machines, allowing it to handle larger scales of data. Boosting algorithms have been known to perform very well on most large data sets, but the iterative process of boosting makes those painfully slow!
How to parallelize GBM
How do you parallelize a boosting algorithm then? In the case of Random Forest, it is easy: you just distribute the data across threads, build independent trees there, and average the resulting tree predictions. In the case of an iterative process like boosting, you need to parallelize the tree building itself. It all comes down to how you find an optimal split in a tree: for each feature, sort the data and linearly scan the feature to find the best split. If you have N samples and M features, it is O(NM log(N)) time complexity at each node. In pseudocode:
best_split = None for feature in features: for sample in sorted samples: if split is better than best_split: best_split = f(feature, sample)
So you can parallelize split search by scanning each feature independently and reducing the resulting splits to the optimal one.
XGBoost is not the first attempt to parallelize GBM, but they used a series of tricks that made it very efficient:
First, all the columns are pre-sorted while keeping a pointer to the original index of the entry. This removes the need to sort the feature at every search.
They used a Compressed Column Format for a more efficient distribution of the data.
They used a cache-aware prefetching algorithm to minimize non-contiguous memory access that results from the pre-sorting step.
Not directly about parallelization, but they came out with an approximated split search algorithm that speeds the tree building further.
As of today, you can train XGBoost across cores on the same machine, but also on AWS YARN, Kubernetes, Spark, and GPU and you can use Dask or Ray to do it (XGBoost documentation).
One thing to look out for is that there is a limit to how much XGBoost can be parallelized. With too many threads, the data communication between threads becomes a bottleneck and the training speed plateaus. Here is an example explaining that effect: “How to Best Tune Multithreading Support for XGBoost in Python”.
Also, you should read the XGBoost article: “XGBoost: A Scalable Tree Boosting System”!
Here are the latest articles you may have missed:
To receive all the full articles and features of the AiEdge Newsletter, consider subscribing:
How to build better trees
Why XGBoost performs better than typical Gradient Boosted Machines (GBM)? They simply don't optimize for the same objective function! GBM tries to minimize the loss function as measured on the training data where XGBoost takes into consideration the complexity of the trees, making XGBoost much more robust to overfitting.
When GBM considers a split in a tree, it asks itself the question: what split will get me better predictive performance as measured on the training data? Trying to maximize performance on the training data will simply not generalize very well to unseen data and this leads to overfitting. GBM has other hyperparameters you can use to regularize your trees but the optimization process itself is not well tuned for generalization. You can look at equation 6 of the original GBM paper for this: “Stochastic Gradient Boosting“.
On the other hand, XGBoost includes a regularization term in the objective function that penalizes complex trees. This is because it looks at the pure decrease of the loss (its gradient - the first derivative of the loss) but also at its increase in complexity (the curvature - the second derivative of the loss). Typically, decreases in the loss diminishes underfitting while increases in complexity leads to more overfitting. XGBoost uses a hyperparameter "lambda" to adjust how much weight is put on the gradient versus the curvature. You can look at equation 6 in the XGBoost paper: “XGBoost: A Scalable Tree Boosting System”. On a side note, it is quite a coincidence that "equation 6" in both the GBM paper and XGBoost paper is talking about their respective objective function, isn't it?!
This is obviously very reminiscent of the concept of Bias-Variance tradeoff and the gradient sure reminds us of the bias and the curvature of the variance. Even their analytical expressions look similar. While they tend to capture similar information, they are most likely not exactly equal though. Bias and Variance are statistical metrics while gradient and curvature are functional ones. Despite those differences, this is one of the first times that those concepts are mathematically formalized in a tree-based algorithm as part of the objective function such that the trees learned by XGBoost are optimal ones with respect to the Bias-Variance tradeoff problem.
Despite the beauty of this objective function, the choice made on the regularization term is quite an ad-hoc one and I am pretty sure we could construct even better ones!
More information about XGBoost
Interesting Github repositories
Awesome Gradient Boosting Research Papers: A curated list of gradient and adaptive boosting papers with implementation.
AutoXGB: auto train XGBoost directly from CSV files, auto tune XGBoost using optuna, auto serve best XGBoot model using fastAPI.
xgboostExplainer: An R package that makes XGBoost models fully interpretable.
XGBoostLSS: An extension of XGBoost to probabilistic forecasting.
XGBoost.jl: XGBoost Julia Package.
Data-Science-Competitions: Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).
LightGBM: LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
Faster training speed and higher efficiency.
Lower memory usage.
Support of parallel, distributed, and GPU learning.
Capable of handling large-scale data.
CatBoost: CatBoost is an algorithm for gradient boosting on decision trees:
Reduce time spent on parameter tuning, because CatBoost provides great results with default parameters
Improve your training results with CatBoost that allows you to use non-numeric factors, instead of having to pre-process your data or spend time and effort turning it to numbers.
Train your model on a fast implementation of gradient-boosting algorithm for GPU. Use a multi-card configuration for large datasets.
Reduce overfitting when constructing your models with a novel gradient-boosting scheme.
Apply your trained model quickly and efficiently even to latency-critical tasks using CatBoost's model applier
H20.ai’s GBM: H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
XGBoost series by StatQuest with Josh Starmer:
XGBoost by Andrew Ng:
XGBoost: How it works, with an example by Sundog Education with Frank Kane: