Today we dig into what might be the most used Machine Learning model in the data science world: XGBoost! There was a point in my career when I was not anymore a machine learning modeler but instead, an XGBoost modeler: using other algorithms became simply unnecessary considering how well XGBoost performed! We are going to look at:
How to parallelize a Gradient Boosted Machine
How to build better trees
Github repos, similar models and Youtube videos
Why did XGBoost become so popular? There are 2 aspects that made the success of XGBoost back in 2014. The first one is the regularized learning objective that allows for better pruning of the trees. The second one is the ability to distribute the Gradient Boosting learning process across multiple threads or machines, allowing it to handle larger scales of data. Boosting algorithms have been known to perform very well on most large data sets, but the iterative process of boosting makes those painfully slow!
How to parallelize GBM
How do you parallelize a boosting algorithm then? In the case of Random Forest, it is easy: you just distribute the data across threads, build independent trees there, and average the resulting tree predictions. In the case of an iterative process like boosting, you need to parallelize the tree building itself. It all comes down to how you find an optimal split in a tree: for each feature, sort the data and linearly scan the feature to find the best split. If you have N samples and M features, it is O(NM log(N)) time complexity at each node. In pseudocode:
best_split = None
for feature in features:
for sample in sorted samples:
if split is better than best_split:
best_split = f(feature, sample)
So you can parallelize split search by scanning each feature independently and reducing the resulting splits to the optimal one.
XGBoost is not the first attempt to parallelize GBM, but they used a series of tricks that made it very efficient:
First, all the columns are pre-sorted while keeping a pointer to the original index of the entry. This removes the need to sort the feature at every search.
They used a Compressed Column Format for a more efficient distribution of the data.
They used a cache-aware prefetching algorithm to minimize non-contiguous memory access that results from the pre-sorting step.
Not directly about parallelization, but they came out with an approximated split search algorithm that speeds the tree building further.
As of today, you can train XGBoost across cores on the same machine, but also on AWS YARN, Kubernetes, Spark, and GPU and you can use Dask or Ray to do it (XGBoost documentation).
One thing to look out for is that there is a limit to how much XGBoost can be parallelized. With too many threads, the data communication between threads becomes a bottleneck and the training speed plateaus. Here is an example explaining that effect: “How to Best Tune Multithreading Support for XGBoost in Python”.
Also, you should read the XGBoost article: “XGBoost: A Scalable Tree Boosting System”!
How to build better trees
Why XGBoost performs better than typical Gradient Boosted Machines (GBM)? They simply don't optimize for the same objective function! GBM tries to minimize the loss function as measured on the training data where XGBoost takes into consideration the complexity of the trees, making XGBoost much more robust to overfitting.
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.