AIreflections#6 - bias-variance tradeoff

Let us revisit the bias-variance tradeoff - a topic of timeless intrigue in statistics and machine learning.

The bias-variance tradeoff is a fundamental concept in machine learning that refers to the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:

  • Bias is the error introduced by approximating a complex real-world problem with a simpler model. Models with high bias tend to underfit the training data, making overly simplistic assumptions [2] [4] [11].

  • Variance is the error introduced by the model’s sensitivity to small fluctuations in the training set. Models with high variance tend to overfit the training data, modeling the random noise instead of the intended outputs [2] [4] [11].

Relationship Between Bias and Variance

There is an inverse relationship between bias and variance [2] [6]:

  • Models with low bias tend to have high variance. They are more complex and flexible, so they fit the training data very closely but don’t generalize well to new data (overfitting) [2] [4] [10].

  • Models with high bias tend to have low variance. They are simpler, so they may underfit the training data but they are less sensitive to the noise and fluctuations in it [2] [4] [10].

The goal is to find the sweet spot that balances bias and variance and minimizes the total error [6] [11]. Too simple a model has high bias and underfits, while too complex a model has high variance and overfits [4] [13]. The optimal model complexity is somewhere in between.

Visualizing the Tradeoff

The bias-variance tradeoff can be visualized by plotting both bias and variance versus model complexity in a single plot, like done here.

  • As model complexity increases, bias decreases but variance increases.
  • The goal is to achieve low bias and low variance. The optimal model complexity minimizes the total error.

Implications for Different ML Algorithms

The bias-variance tradeoff manifests differently in different ML algorithms [2] [4]:

  • Linear algorithms like linear regression have high bias but low variance
  • Non-linear algorithms like decision trees have low bias but high variance
  • Deep learning models are more complex and can reduce bias without increasing variance as much, but they still exhibit some tradeoff [1] [10]

Tuning hyperparameters like regularization strength, tree depth, or number of neighbors can adjust the bias-variance balance for a given algorithm [2] [4].

Mitigating the Tradeoff

Some techniques can help reduce bias and variance simultaneously:

  • Using more training data reduces variance without impacting bias [11] [12]
  • Cross-validation helps assess generalization performance and tune model complexity [12]
  • Ensemble methods like bagging and boosting can reduce variance [12]
  • Regularization techniques constrain model complexity to reduce overfitting [4] [12]

However, the irreducible error due to noise in the data provides a lower bound on the total error that cannot be overcome [16].

In summary, the bias-variance tradeoff is a key concept to understand when developing ML models. The goal is to find the right model complexity to minimize both bias and variance and achieve the best generalization performance. Different algorithms have different tradeoffs, but techniques like cross-validation, regularization, and ensembles can help strike the right balance.

Mathematical derivation of bias-variance decomposition

Let \(f(x)\) be the true underlying function we are trying to learn. We assume \(f\) is fixed but unknown. Let \(y\) be the observed target variable which is related to \(f(x)\) by:

\[y = f(x) + \epsilon\]

where \(\epsilon\) is random noise with mean zero and variance \(\sigma^2_\epsilon\). That is:

\[\mathbb{E}[\epsilon] = 0, \quad \mathbb{E}[\epsilon^2] = \sigma^2_\epsilon\]

Let \(\hat{f}(x)\) be the function learned by our model from a finite training set \(\mathcal{D}\). Note that \(\hat{f}\) is a random variable, since it depends on the randomness in \(\mathcal{D}\).

The expected squared prediction error of \(\hat{f}\) at a point \(x\) is:

\[\mathbb{E}_{\mathcal{D},\epsilon}\left[(y - \hat{f}(x))^2\right] = \mathbb{E}_{\mathcal{D},\epsilon}\left[(f(x) + \epsilon - \hat{f}(x))^2\right]\]

Expanding the square and using linearity of expectation:

\[\begin{align*} \mathbb{E}_{\mathcal{D},\epsilon}\left[(f(x) + \epsilon - \hat{f}(x))^2\right] &= \mathbb{E}_{\mathcal{D},\epsilon}\left[f(x)^2 + \epsilon^2 + \hat{f}(x)^2 + 2f(x)\epsilon - 2f(x)\hat{f}(x) - 2\epsilon\hat{f}(x)\right] \\ &= f(x)^2 + \mathbb{E}[\epsilon^2] + \mathbb{E}_\mathcal{D}[\hat{f}(x)^2] + 2f(x)\mathbb{E}[\epsilon] - 2f(x)\mathbb{E}_\mathcal{D}[\hat{f}(x)] - 2\mathbb{E}_\mathcal{D}[\hat{f}(x)]\mathbb{E}[\epsilon] \\ &= f(x)^2 + \sigma^2_\epsilon + \mathbb{E}_\mathcal{D}[\hat{f}(x)^2] - 2f(x)\mathbb{E}_\mathcal{D}[\hat{f}(x)] \end{align*}\]

In the last step, we used the fact that \(\mathbb{E}[\epsilon]=0\) and that \(\epsilon\) is independent of \(\hat{f}\) so \(\mathbb{E}_\mathcal{D}[\hat{f}(x)]\mathbb{E}[\epsilon]=0\).

Now, let’s add and subtract \(\mathbb{E}_\mathcal{D}[\hat{f}(x)]^2\) to get:

\[\begin{align*} \mathbb{E}_{\mathcal{D},\epsilon}\left[(y - \hat{f}(x))^2\right] &= f(x)^2 + \sigma^2_\epsilon + \mathbb{E}_\mathcal{D}[\hat{f}(x)^2] - 2f(x)\mathbb{E}_\mathcal{D}[\hat{f}(x)] + \mathbb{E}_\mathcal{D}[\hat{f}(x)]^2 - \mathbb{E}_\mathcal{D}[\hat{f}(x)]^2 \\ &= \sigma^2_\epsilon + \left(\mathbb{E}_\mathcal{D}[\hat{f}(x)]^2 - 2f(x)\mathbb{E}_\mathcal{D}[\hat{f}(x)] + f(x)^2\right) + \left(\mathbb{E}_\mathcal{D}[\hat{f}(x)^2] - \mathbb{E}_\mathcal{D}[\hat{f}(x)]^2\right) \\ &= \sigma^2_\epsilon + \left(\mathbb{E}_\mathcal{D}[\hat{f}(x)] - f(x)\right)^2 + \mathbb{E}_\mathcal{D}\left[\left(\hat{f}(x) - \mathbb{E}_\mathcal{D}[\hat{f}(x)]\right)^2\right] \end{align*}\]

The three terms in the final expression are:

  1. \(\sigma^2_\epsilon\) is the irreducible error due to the noise in the data. This cannot be reduced by any model.

  2. \(\left(\mathbb{E}_\mathcal{D}[\hat{f}(x)] - f(x)\right)^2\) is the squared bias, the amount by which the average prediction over all possible training sets differs from the true value.

  3. \(\mathbb{E}_\mathcal{D}\left[\left(\hat{f}(x) - \mathbb{E}_\mathcal{D}[\hat{f}(x)]\right)^2\right]\) is the variance, the expected squared deviation of \(\hat{f}(x)\) around its mean.

Therefore, we have decomposed the expected prediction error into three parts: irreducible error, bias, and variance:

\[\mathbb{E}_{\mathcal{D},\epsilon}\left[(y - \hat{f}(x))^2\right] = \sigma^2_\epsilon + \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x))\]

This is the bias-variance decomposition. It shows that to minimize the expected prediction error, we need to simultaneously minimize both the bias and variance. However, there is usually a tradeoff between bias and variance - models with high bias tend to have low variance and vice versa. The art of machine learning is to find the sweet spot that balances bias and variance to minimize the total prediction error.


References

[1] reddit.com: Do Successful Models Defy the Bias-Variance Tradeoff?
[2] machinelearningmastery.com: Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning
[3] towardsdatascience.com: Bias, Variance and How They Are Related to Underfitting, Overfitting
[4] serokell.io: Bias-Variance Tradeoff Explained
[5] datascience.stackexchange.com: Relation Between Underfitting vs High Bias and Low Variance
[6] kdnuggets.com: Understanding the Bias-Variance Trade-off in 3 Minutes
[7] machinelearningcompass.com: Bias and Variance in Machine Learning
[8] elitedatascience.com: Bias-Variance Tradeoff: Intuitive Explanation
[9] geeksforgeeks.org: Underfitting and Overfitting in Machine Learning
[10] towardsdatascience.com: Examples of Bias-Variance Tradeoff in Deep Learning
[11] towardsdatascience.com: Understanding the Bias-Variance Tradeoff
[12] mastersindatascience.org: The Difference Between Bias and Variance
[13] javatpoint.com: Bias and Variance in Machine Learning
[14] geeksforgeeks.org: Bias-Variance Trade-Off in Machine Learning
[15] cs.cornell.edu: Bias-Variance Decomposition
[16] wikipedia.org: Bias–Variance Tradeoff
[17] mlu-explain.github.io: Bias-Variance Tradeoff
[18] shiksha.com: Bias and Variance in Machine Learning
[19] pnas.org: Reconciling Modern Machine Learning Practice and the Bias-Variance Trade-Off
[20] towardsai.net: Bias-Variance Decomposition 101: A Step-by-Step Computation
[21] cs.toronto.edu: Lecture 5 - Decision Trees & Bias-Variance Decomposition
[22] jmlr.org: Bias–Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods
[23] allenkunle.me: Bias-Variance Decomposition
[24] stanford.edu: Bias-Variance Analysis
[25] cs.toronto.edu: Lecture 2 - Linear Regression & Bias-Variance Decomposition
[26] oregonstate.edu: Bias–Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods
[27] scikit-learn.org: Single Estimator versus Bagging: Bias-Variance Decomposition
[28] rasbt.github.io: Bias-Variance Decomposition
[29] stats.stackexchange.com: Understanding Bias-Variance Tradeoff Derivation
[30] berkeley.edu: Linear Regression and the Bias-Variance Tradeoff
[31] cmu.edu: Bias-Variance Decomposition
[32] cornell.edu: Bias-Variance Tradeoff and Ridge Regression
[33] oregonstate.edu: Bias-Variance Tradeoff
[34] towardsdatascience.com: Bias and Variance for Model Assessment
[35] wikipedia.org: Bias–Variance Tradeoff
[36] towardsdatascience.com: The Bias-Variance Tradeoff
[37] stats.stackexchange.com: Bias-Variance Tradeoff with SVMs
[38] youtube.com: 12 - Bias-Variance Tradeoff

Based on a chat with claude-3-opus on perplexity.ai

Written on April 8, 2024