Understanding the Bias-Variance Trade-off in Machine Learning
Written on
Chapter 1: Introduction to Bias and Variance
In the realm of machine learning, we gather data and construct models using training datasets. These models are then applied to test data—data that the model has not encountered before—to make predictions. The primary goal is to minimize prediction errors. While we aim to reduce training errors, our main concern lies in the test error or prediction error, which is influenced by both bias and variance.
The bias-variance trade-off is essential for addressing the following challenges:
- Preventing underfitting and overfitting.
- Ensuring consistency in predictions.
Let’s delve into the concepts that underlie the bias-variance trade-off.
Section 1.1: The Role of Model Complexity
Before we explore the bias-variance trade-off, it's important to understand how training error and prediction error vary with increased model complexity.
Imagine we have data points that represent a relationship between X and Y. The true relationship, or function, between these variables is represented as f(X). This function remains unknown:
Y = f(X) + ε
Our task is to construct a model that accurately portrays the relationship between X and Y.
Input → Model → Output
The learning algorithm processes the input and generates a function that illustrates the relationship:
Input → Learning Algorithm → f̂(X)
For instance, in linear regression, the learning algorithm utilizes gradient descent to identify the optimal fit line based on the cost function, specifically Ordinary Least Squares (OLS).
Consider a dataset split into training and test data:
- Training data: Used to build the model.
- Test data: Used for making predictions based on the established model.
Let’s analyze four models developed from the training data, with assumptions about how Y relates to X.
Simple Model (Degree 1):
y = f̂(x) = w0 + w1x
The fitted line diverges significantly from the data points, leading to high fitting or training error.
Degree 2 Polynomial:
y = f̂(x) = w0 + w1x + w2x²
Degree 5 Polynomial:
Complex Model (Degree 20):
The fitted curve aligns perfectly with all data points, resulting in minimal training error. However, this model tends to memorize the data, including noise, rather than generalizing. Consequently, it performs poorly on unseen validation data, a phenomenon known as overfitting.
When we predict with these four models on validation data, the prediction errors will vary.
Next, let's visualize the relationship between training error and prediction error against model complexity (in terms of polynomial degree).
From the graph, it’s evident that as model complexity increases (from degree 1 to degree 20), training error decreases. However, prediction error initially decreases and then starts to rise as the model becomes overly complex.
This illustrates the trade-off between training error and prediction error. At one end of the spectrum, we observe high bias, while the other end showcases high variance. Thus, finding the ideal model complexity involves balancing bias and variance.
Section 1.2: Defining Bias and Variance
Bias
Let’s denote f(x) as the true model and f̂(x) as its estimate. The bias is defined as:
Bias(f̂(x)) = E[f̂(x)] - f(x)
This metric indicates the discrepancy between the expected value and the true function. To determine the expected value of the model, we create f̂(x) using a consistent form (e.g., polynomial degree 1) across various random samples derived from the training data.
In the following plot, the orange curve represents the average of all complex models (degree=20) applied to different random samples, while the green line represents simple models (degree=1).
From this, it’s clear that simple models exhibit high bias, as their average function deviates significantly from the true function, while complex models show low bias because they fit the data closely.
Variance
Variance reflects how one estimate f̂(x) varies from the model's expected value E(f̂(x)):
Variance(f̂(x)) = E[(f̂(x) - E[f̂(x)])²]
Complex models tend to have higher variance since minor changes in training samples can lead to substantial differences in f̂(x). In contrast, simple models maintain relatively consistent estimates even with slight alterations to the training sample, generalizing the underlying pattern.
Therefore, we can summarize:
- Simple models typically have high bias and low variance.
- Complex models tend to have low bias but high variance.
Chapter 2: Expected Prediction Error
The expected prediction error (EPE) is influenced by three components:
- Bias
- Variance
- Noise (Irreducible Error)
The formula for expected prediction error is:
EPE = Bias² + Variance + Irreducible Error
Using the model f̂(x), we predict the value of a new data point (x, y) not included in the training data. The expected mean square error can be expressed as:
EPE = E[(y - f̂(x))²]
From this equation, it is evident that error is contingent upon both bias and variance.
The following observations can be made:
- High bias correlates with high prediction error.
- High variance also leads to increased prediction error.
Key Takeaways
- Simple models are characterized by high bias and low variance, often leading to underfitting.
- Complex models display low bias and high variance, frequently resulting in overfitting.
- Ideally, the best fit model will achieve low bias and low variance.
Thank you for reading! If you’re interested in more tutorials, feel free to follow me on Medium, LinkedIn, or Twitter.
The first video, "Machine Learning Fundamentals: Bias and Variance," provides a comprehensive overview of these concepts, explaining their importance in model performance.
The second video, "Bias Variance Trade-off Clearly Explained!! Machine Learning Tutorials," offers a clear breakdown of the bias-variance trade-off and its implications for model training and evaluation.