AIreflections#5 - bagging and boosting

Let us understand the concepts of bagging and boosting in machine learning.

Overview

Bagging (Bootstrap Aggregating) and Boosting are both ensemble learning methods that combine multiple weak learners to create a strong learner. The key difference is:

  • Bagging trains weak learners independently in parallel on random subsets of data and combines their predictions by averaging or voting. This reduces variance and helps prevent overfitting.

  • Boosting trains weak learners sequentially, with each learner trying to correct the errors of the previous one. Subsequent models are weighted based on the performance of previous models. This reduces bias and can convert weak learners to strong ones.

Key Differences

  1. Goal:
    • Bagging aims to reduce variance and prevent overfitting
    • Boosting aims to reduce bias and improve accuracy
  2. Training:
    • In bagging, each model is trained independently in parallel on a random subset of data
    • In boosting, models are trained sequentially, with each model trying to correct the errors of the previous model
  3. Weighting:
    • Bagging typically gives equal weight to all models
    • Boosting assigns higher weight to better performing models and increases the weight of misclassified examples
  4. Examples:
    • Random Forest is a popular bagging algorithm that uses decision trees as the base learner
    • AdaBoost and Gradient Boosting are well-known boosting algorithms
  5. Overfitting:
    • Bagging helps reduce overfitting by decreasing variance
    • Boosting can sometimes overfit if there are too many rounds or the base models are too complex [1] [2] [3] [4]

When to Use

  • Bagging is preferred when the model has high variance and low bias. It is effective for complex models like deep decision trees that are prone to overfitting.

  • Boosting is preferred when the model has high bias and low variance. It can effectively improve the accuracy of weak learners like shallow decision trees.

In general, boosting tends to achieve better accuracy than bagging, but it is also more sensitive to noisy data and outliers. The best approach depends on the specific dataset and problem [5] [6].

Both bagging and boosting have proven to be powerful techniques for improving machine learning model performance and are widely used in practice, often producing state-of-the-art results on challenging problems [7] [8].

Gradient boosting and extreme gradient boosting

Gradient boosting and extreme gradient boosting (XGBoost) are powerful machine learning techniques for building predictive models. Here is a concise, in-depth, and intuitive explanation of how they work:

Gradient Boosting

Gradient boosting builds an ensemble of weak learner models, typically decision trees, in a stage-wise fashion to create a strong learner. The key ideas are:

  1. Each new tree is fit to the pseudo-residuals (negative gradient) of the loss function with respect to the ensemble predictions made by the previous trees. This allows subsequent models to correct the mistakes of the previous ones.

  2. The predictions of each tree are scaled by a learning rate (shrinkage factor) before being added to the ensemble. This controls the contribution of each tree and helps prevent overfitting.

  3. The final model prediction is the sum of the predictions of all the trees in the ensemble [15] [16] [17].

Intuitively, gradient boosting can be seen as performing gradient descent in function space. At each iteration, it fits a new tree that moves the ensemble predictions in the direction that minimizes the loss the most, as indicated by the negative gradient [18] [25].

Extreme Gradient Boosting (XGBoost)

XGBoost is an optimized and regularized implementation of gradient boosting that has become very popular due to its speed and performance. It extends gradient boosting with several enhancements:

  1. Regularization: XGBoost adds L1 and L2 regularization terms to the loss function which helps prevent overfitting.

  2. Tree pruning: XGBoost uses a max_depth parameter to limit tree depth and a min_child_weight parameter to prune trees to avoid overly complex models.

  3. Handling missing values: XGBoost has built-in routines for handling missing values.

  4. Weighted quantile sketch: XGBoost employs a distributed weighted quantile sketch algorithm to find candidate split points efficiently.

  5. Sparsity-aware split finding: XGBoost can handle sparse data efficiently [30] [31] [32].

These optimizations allow XGBoost to often outperform regular gradient boosting in terms of both speed and accuracy. It has become a go-to algorithm for many machine learning competitions and real-world applications [22] [29].

In summary, gradient boosting and XGBoost work by sequentially adding decision trees to an ensemble, with each tree trained to correct the errors of the previous trees. XGBoost extends this with additional optimizations to improve speed, generalization, and handling of real-world data challenges. The result is a powerful and efficient algorithm for building state-of-the-art predictive models [26] [33] [34].

Quantitative view of gradient boosting and XGBoost

Gradient Boosting - Pseudo-residuals and Loss Function

In gradient boosting, the pseudo-residuals \(r_{im}\) for the \(i\)-th training instance at the \(m\)-th iteration are defined as the negative gradient of the loss function \(L\) with respect to the predicted value \(\hat{y}_i^{(m-1)}\) from the previous iteration:

\[r_{im} = -\left[\frac{\partial L(y_i, \hat{y}_i)}{\partial \hat{y}_i}\right]_{\hat{y}_i=\hat{y}_i^{(m-1)}}\]

For a mean squared error loss, this simplifies to:

\[r_{im} = y_i - \hat{y}_i^{(m-1)}\]

which are just the ordinary residuals.

The objective at each iteration is to find a weak learner \(h_m(x)\) that minimizes the loss:

\[\mathcal{L}^{(m)} = \sum_{i=1}^n L\left(y_i, \hat{y}_i^{(m-1)} + h_m(x_i)\right)\]

Gradient Boosting - Learning Rate and Ensemble Prediction

After the \(m\)-th weak learner is found, the ensemble prediction is updated as:

\[\hat{y}_i^{(m)} = \hat{y}_i^{(m-1)} + \nu \cdot h_m(x_i)\]

where \(\nu\) is the learning rate (shrinkage factor) that controls the contribution of each new weak learner. Typical values are \(\nu \in [0.01, 0.1]\).

XGBoost - Regularization and Loss Function

XGBoost’s objective function at the \(m\)-th iteration includes a regularization term \(\Omega(h_m)\):

\[\mathcal{L}^{(m)} = \sum_{i=1}^n L\left(y_i, \hat{y}_i^{(m-1)} + h_m(x_i)\right) + \Omega(h_m)\]

The regularization term for a tree \(h_m\) with \(T\) leaves and leaf weights \(w\) is defined as:

\[\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \|w\|_2^2\]

where \(\gamma\) is the minimum loss reduction required to make a split, and \(\lambda\) is the L2 regularization coefficient on leaf weights.

XGBoost - Tree Pruning and Handling Missing Values

XGBoost uses a maximum depth parameter to limit tree depth and a minimum child weight (sum of instance weights) parameter to prune trees:

\[\text{gain} = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma\]

A split is made only if the gain is positive. \(G_L, G_R\) are sums of gradients and \(H_L, H_R\) are sums of hessians in left/right child.

For missing values, XGBoost learns an optimal default direction for each tree node. It computes gains for assigning instances with missing values to the left or right, and picks the direction with higher gain.

XGBoost - Weighted Quantile Sketch and Sparsity-aware Split Finding

To find candidate split points, XGBoost builds weighted quantile sketches of feature values. For a dataset \(\{(x_1, h_1), \ldots, (x_n, h_n)\}\) with hessians as weights, it divides the data into \(\epsilon\)-approximate quantiles so each quantile has roughly \(\frac{1}{\epsilon} \sum_{k=1}^n h_k\) total weight.

For sparse data, XGBoost avoids missing values during split finding. It sorts the non-missing entries and performs a linear scan to decide the best split. The default direction for missing values is determined by the gain as described earlier.


References

[1] towardsdatascience.com: Ensemble Learning: Bagging and Boosting Explained
[2] datascience.stackexchange.com: What’s the Difference Between Bagging and Boosting?
[3] projectpro.io: Bagging vs Boosting in Machine Learning: A Comprehensive Guide
[4] linkedin.com: How Do You Choose Between Bagging and Boosting for Your Machine Learning Model?
[5] upgrad.com: Bagging vs Boosting: Key Differences and When to Use What
[6] aiacceleratorinstitute.com: Boosting and Bagging: Powerful Ensemble Methods in Machine Learning
[7] geeksforgeeks.org: Bagging vs Boosting in Machine Learning
[8] baeldung.com: Bagging, Boosting, and Stacking: Ensemble Models in Machine Learning
[9] corporatefinanceinstitute.com: What is Bagging (Bootstrap Aggregation)?
[10] datatrained.com: Bagging and Boosting: Ensemble Learning Techniques Explained
[11] towardsdatascience.com: Ensemble Methods: Bagging, Boosting, and Stacking
[12] stats.stackexchange.com: What is the Difference Between Bagging, Boosting and Stacking in Machine Learning?
[13] pickl.ai: Bagging vs Boosting in Machine Learning: Which One to Choose?
[14] kaggle.com: Bagging vs Boosting: A Comparative Analysis
[15] cse.chalmers.se: An Intuitive Explanation of Gradient Boosting
[16] towardsdatascience.com: The Intuition Behind Gradient Boosting & XGBoost
[17] datacamp.com: A Guide to The Gradient Boosting Algorithm
[18] wikipedia.org: Gradient Boosting
[19] towardsdatascience.com: XGBoost (Extreme Gradient Boosting): How to Improve on Regular Gradient Boosting
[20] youtube.com: Gradient Boost Part 1: Regression Main Ideas
[21] machinelearningmastery.com: A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
[22] towardsdatascience.com: All You Need to Know About Gradient Boosting Algorithm (Part 1: Regression)
[23] machinelearningmastery.com: How to Develop an Extreme Gradient Boosting Ensemble in Python
[24] geeksforgeeks.org: XGBoost (Extreme Gradient Boosting)
[25] explained.ai: Gradient Boosting Explained
[26] neptune.ai: Gradient Boosted Decision Trees: A Complete Guide
[27] machinelearningmastery.com: How to Configure the Gradient Boosting Algorithm
[28] geeksforgeeks.org: Gradient Boosting in Machine Learning
[29] kdnuggets.com: An Intuitive Ensemble Learning Guide to Gradient Boosting
[30] neptune.ai: XGBoost: Everything You Need to Know
[31] nvidia.com: What is XGBoost?
[32] wikipedia.org: XGBoost
[33] towardsdatascience.com: Boosting Algorithms: Gradient Boosting in Python
[34] towardsdatascience.com: XGBoost: An Intuitive Explanation
[35] python.plainenglish.io: The Complete XGBoost Therapy with Python
[36] machinelearningmastery.com: How to Develop a Gradient Boosting Machine Ensemble in Python
[37] towardsdatascience.com: Understanding Gradient Boosting from Scratch with a Small Dataset
[38] geeksforgeeks.org: XGBoost for Regression
[39] datascience.stackexchange.com: Correct Theoretical Regularized Objective Function for XGB/LGBM Regression Task
[40] statlect.com: Gradient Boosting
[41] paperspace.com: Gradient Boosting for Classification: A Beginner’s Guide
[42] arxiv.org: XGBoost: A Scalable Tree Boosting System
[43] towardsdatascience.com: The Notorious XGBoost
[44] bradleyboehmke.github.io: Gradient Boosting Machines
[45] towardsdatascience.com: All You Need to Know About Gradient Boosting Algorithm (Part 1: Regression)
[46] datascience.stackexchange.com: Why Do We Use Gradients Instead of Residuals in Gradient Boosting?
[47] datacamp.com: A Guide to The Gradient Boosting Algorithm
[48] neptune.ai: XGBoost: Everything You Need to Know
[49] fastruby.io: Introduction to Gradient Boosting
[50] machinelearningplus.com: An Introduction to Gradient Boosting Decision Trees
[51] towardsdatascience.com: The Intuition Behind Gradient Boosting & XGBoost
[52] linkedin.com: How Does XGBoost Really Work?
[53] deep-and-shallow.com: The Gradient Boosters III: XGBoost
[54] sandiego.edu: Supervised Machine Learning with Gradient Boosting

Based on a chat with claude-3-opus on perplexity.ai

Written on April 8, 2024