Fun Use of Statistics in Machine Learning!

Dec 09, 2024

∙ Paid

Estimating The Bias variance decomposition
Computing the confidence interval of predictions
Conformal prediction theory
A/B Testing

Machine Learning is a bit more than just training models and predicting on new data! There is a strong component involving statistics that we often ignore. Let’s have a little fun and apply some statistical theory to infer interesting behaviors from our models and data!

Estimating The Bias variance decomposition

The formalism

In supervised learning, we have an input x, and we are trying to learn its relation to a variable y = f(x). The assumption is there are statistical fluctuations of that value:

\(y = f(x) +\epsilon\)

where ϵ is a random noise. Every time we observe x, we get y with a fluctuation ϵ.

The bias-variance decomposition breaks down a model's expected prediction error into three components:

Bias: The error caused by simplified assumptions in the model. High bias means the model is underfitting - it's too simple to capture the underlying patterns.
Example: Using linear regression for highly non-linear data would have high bias.
Variance: The error from the model's sensitivity to fluctuations in the training data. High variance means the model is overfitting - it's too complex and captures noise in the training data.
Example: A decision tree grown to full depth would have high variance as it perfectly fits training data.
Irreducible Error σ: The inherent noise in the problem that can't be reduced by any model.

Mathematically, for a given point x, the expected mean squared error can be decomposed as:

\(E[(y - g(x))²] = (Bias[g(x)])² + Var[g(x)] + \sigma^2\)

Where:

g(x) is our model's prediction
y is the true value
σ² is the irreducible error

Here, the expectation E[ . ] value is over all possible training sets. The formula for the Bias is:

\(Bias[g(x)] = E[g(x)] - f(x)\)

It represents the shift between the average output of the model E[g(x)] and the average value of y when we observe x. Since y = f(x) + ϵ, we have

\(E[y|x] = E[f(x)|x] + E[\epsilon|x]=f(x)\)

The formula for the variance is:

\(Var[g(x)] = E[(g(x) - E[g(x)])²]\)

It represents how much the predictions vary when we change the training set.

The formula for the irreducible error σ² is:

\(\sigma^2 = E[(y - E[y|x])^2]\)

It represents the variance Var[ϵ] of the random noise ϵ.

The Kohavi and Wolpert’s holdout procedure

One of my favorite papers is Estimating bias and variance from data. It lists a set of strategies to estimate Bias and Variance from the data. One of those strategies is the Kohavi and Wolpert’s holdout procedure. The idea is to split the original data into a training set and a test set and then to sample multiple times from the train set subsets of the data to create similar but different sub-training sets.

Then, we can train a machine learning model for each of the sub-training sets and predict the test set. From the predictions, we can estimate the bias and variance.

To better understand the strategy, let’s implement a simple version of it! The following function captures how we sample predictions on the test set:

from sklearn.model_selection import train_test_split
import numpy as np

def kohavi_wolpert_process(X, y, learner, train_size=0.7, n_trials=200): 
    # we generate the test and super-training sets
    X_pool, X_test, y_pool, y_test = train_test_split(
        X, y,
        train_size=train_size,
        random_state=123
    )
    
    predictions = np.zeros((n_trials, len(y_test)))
    m = len(X_pool) // 2 
    for i in range(n_trials):
        # For each trial, we sample N / 2 examples without replacement 
        indices = np.random.choice(len(X_pool), size=m, replace=False)
        X_train = X_pool[indices]
        y_train = y_pool[indices]
        
        # For each sub training set, we train a model.
        model = learner()
        model.fit(X_train, y_train)
        # For each model, we get the predictions from the test set.
        predictions[i] = model.predict(X_test)

    return predictions, y_test

Let's now write a function that computes the estimates of the bias and variance. The bias is

\(Bias[g(x)] = E[g(x)] - f(x)\)

So, for each sample i in the test set, we are going to compute:

\(\widehat{Bias}_i=\frac{\sum_{j=1}^{N}\hat{y}_{ij}}{N}-y_i\)

where N is the number of training sets. Notice that this estimation is biased because We assume that y_i = f(x_i) and ϵ = 0. This is because we only have one instance of y_ifor each x_i, so we are making the approximation.

Since the expected mean squared error is decomposed as:

\(E[(y - g(x))²] = (Bias[g(x)])² + Var[g(x)] + \sigma^2\)

We are going to compute the average of the estimates of the Bias over all the samples:

\(\frac{\sum_{i=1}^n(\widehat{Bias}_i)^2}{n}\)

def estimate_bias(predictions, true_labels):
    # We compute the mean predictions for all samples
    mean_predictions = np.mean(predictions, axis=0)
    # We compute the estimated squared bias
    bias = np.mean((mean_predictions - true_labels) ** 2)
    return bias

To estimate the variance, we can use the typical estimate:

\(\widehat{Var}_i=\frac{\sum_{j=1}^N(\hat{y}_{ij}-\overline{\hat{y}_{ij}})^2}{N-1}\)

Once we computed the estimate for each sample, we can compute the average over all the samples:

\(\frac{\sum_{i=1}^n\widehat{Var}_i}{n}\)

def estimate_variance(predictions):
    variance = np.mean(np.var(predictions, axis=0, ddof=1))
    return variance

Now, we can measure the Bias and Variance for a Linear Regression model, for example:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_friedman1

X, y = make_friedman1(
    n_samples=10000, 
    n_features=100,
    noise=0.5
)

# We run the kohavi-wolpert process for `LinearRegression`
predictions_LR, y_test = kohavi_wolpert_process(X, y, LinearRegression)
# Compute the bias and variance
bias_LR = estimate_bias(predictions_LR, y_test)
variance_LR = estimate_variance(predictions_LR)

bias_LR, variance_LR

> 6.377, 0.096

We can do the same for a Decision Tree:

from sklearn.tree import DecisionTreeRegressor

# We run the kohavi-wolpert process for DecisionTreeRegressor
predictions_tree, y_test = kohavi_wolpert_process(X, y, DecisionTreeRegressor)
# Compute the bias and variance
bias_tree = estimate_bias(predictions_tree, y_test)
variance_tree = estimate_variance(predictions_tree)

bias_tree, variance_tree

> 2.339, 4.929

As expected, linear regression is a high-bias / low-variance model, whereas decision tree is a low-bias / high-variance model!

The Valentini and Dietterich’s out-of-bag technique

There are a few problems with Kohavi and Wolpert's holdout procedure:

It requires splitting the data into two parts, and only part of the data is ever used for testing.
The fixed test set might not be representative of the whole dataset. The results could be sensitive to the particular train-test split chosen and the estimates might not generalize well.
The train sets are small subsets of the overall data, so the trained model may not be representative of models trained on the full training set.

This is what Valentini and Dietterich’s out-of-bag technique attempts to address. Instead of splitting the data into train and test sets, we are going to create bootstrap samples to train our models and predict on the remaining samples (the out-of-bag samples (OOB)). With bootstrap samples, on average, ~63% of the original data will be contained in the sample, and ~37% of the data will be OOB.

In the end, we are going to have multiple predictions for each sample in the data.

Here is the function to sample the predictions:

from sklearn.utils import resample

def valentini_dietterich_process(X, y, learner, n_trials=200):
   
    n_samples = len(X)
    # all the indices corresponding to each sample in the data
    all_indices = list(range(n_samples))
    # empty lists to store the predictions
    predictions = [[] for _ in all_indices]
    
    for _ in range(n_trials):
        # For each trial, we resample the indices with replacement 
        bootstrap_indices = resample(
            all_indices, 
            replace=True,
            n_samples=n_samples
        )
        
        # we get the indices of the OOB samples
        oob_indices = list(set(all_indices) - set(bootstrap_indices))
        # we train the learner on the bootstrap samples 
        X_boot = X[bootstrap_indices]
        y_boot = y[bootstrap_indices]
        model = learner()
        model.fit(X_boot, y_boot)
        
        # we predict on the OOB samples
        X_oob = X[oob_indices]
        prediction = model.predict(X_oob)
            
        # we store the predictions
        for idx, pred in zip(oob_indices, prediction):
            predictions[idx].append(pred)
    
    return predictions

We need to adapt our functions to estimate the Bias and Variance!

def estimate_bias(predictions, true_labels):
    mean_predictions = [np.mean(pred) for pred in predictions]
    bias = np.mean([
       (pred - y) ** 2 for pred, y 
       in zip(mean_predictions, true_labels)
    ])
    return bias

def estimate_variance(predictions):
    variance = np.mean([np.var(pred) for pred in predictions])
    return variance

And we can run the process to estimate those quantities:

predictions_LR = valentini_dietterich_process(X, y, LinearRegression)
bias_LR = estimate_bias(predictions_LR, y)
variance_LR = estimate_variance(predictions_LR)

predictions_tree = valentini_dietterich_process(X, y, DecisionTreeRegressor)
bias_tree = estimate_bias(predictions_tree, y)
variance_tree = estimate_variance(predictions_tree)

bias_LR, variance_LR, bias_tree, variance_tree

> 6.206, 0.062, 2.036, 4.251

This leads to similar values and similar conclusions on the level of Bias and Variance in models like linear regression and decision trees.

Computing the confidence interval of predictions

“How confident are you in this specific prediction?” Most Machine Learning engineers couldn't care less about this question! We know the prediction is correct a certain percentage of the time in general, but we often cannot say much about a specific prediction. That is fine when we predict personalized sales conversion, but what about when lives are at play? For example, how confident are the predictions of the risk of heart disease in patients?

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.