TheAiEdge+: How to deal with Imbalanced Datasets

Imbalanced data is not always bad!

May 08, 2023

∙ Paid

With class imbalance, the odds are not in your favor, or are they? Working with imbalanced data is not always an issue and it is not clear that balancing the dataset actually improve the predictive performance! We cover:

Are imbalanced datasets bad?
Why is AUC biased for imbalanced data?
How to optimally sample imbalanced Data?
Learn more about imbalanced datasets: articles, Youtube videos, GitHub repositories

Are Imbalanced datasets bad?

Working with imbalanced data is not always an issue! I often see Data Scientists overly concerned with it and trying to adjust the target variable distribution without a second thought. I am not saying it cannot have a positive effect but it might not be clear that it does. Are you sure your model is actually better by adjusting the distribution? Even if you measure the predictive performance on the unweighted test set, the reading might not be completely comparable.

By balancing a dataset, you will modify the distribution of predicted probabilities. A dataset with a 10% - 90% distribution of the binary target variable will likely lead to a model with an average probability <p> = 0.1, where a dataset with a 50% - 50% distribution should lead to a resulting model closer to <p> = 0.5. Because the Precision, Recall and Accuracy metrics are usually created by thresholding the inferred probabilities at 0.5, 2 different models trained with a dataset with 2 different distributions will lead to vastly different readings for those metrics. What would be the reading if you thresholded it at 0.1 instead? A probabilistic metric like cross-entropy does not suffer from that thresholding effect.

AUC is a more robust metric to changes in the distribution for imbalanced targets, however AUC tends to overestimate the predictive performance of models for extremely imbalanced datasets. This is due to the fact that AUC considers the True positive rate (TPR) and the False positive rate on the same scale, but for very imbalanced probability distributions, the TPR will be measured as ~1 for most probability thresholds shooting up the AUC close to unrealistic ~1 values. The Precision-Recall AUC is a much more actionable metric for highly imbalanced data as it does not suffer from the problem described above. See the following articles for more information

When I was working on Ads Ranking problems, we had too much data in a very imbalanced problem, so we had to downsample the majority class such that the computations were more manageable but we were not looking for a perfect balance as imbalanced data do not generally lead to worst models. We were recalibrating the probabilities to make sure the Click Through Rate was equal to the average probability <p> by using the simple formula:

\(p'=\frac{ p }{p + (1-p) s}\)

where s is the negative sampling rate.

Lastly, is it still necessary to emphasize how useless SMOTE is for imbalanced datasets versus simply reweighing the loss function directly? It oversamples the minority class(es) to increase the information given to the model about the minority class. Using SMOTE is computationally very costly and mathematically equivalent to changing the weights given to the different classes (ex: sample_weight). The case where drastically changing the weights can be risky is when you train a neural network. If the batches are too small, you can risk gradient explosions because the loss function batches will have a higher variance. Other than that use case, why would you choose SMOTE over simply adjusting the weights to correct the loss function?!

Why is AUC biased for imbalanced data?

When I discovered the ROC AUC metric, I thought it was the perfect metric! It is normalized to the [0,1] range, it is a probabilistic metric (kind of!) and it is robust to imbalance in the target classes. I once had a very imbalanced dataset, and I kept getting amazing predictive performance as measured by ROC AUC. That is when I discovered the limitations of that metric!

The ROC curve plots the True Positive Rate (TPR) as a function of the False Positive Rate (FPR). We have:

\(\begin{align*} \bullet \qquad & TPR = \frac{\text{True positives}} {\text{positives}} = \frac{TP} {P}\\ \bullet \qquad & FPR = \frac{\text{False positives}}{\text{negatives}} = \frac{FP}{N} \end{align*}\)

The curve is drawn by ranking the probability score of the model and computing those rates for each value of the score used as a threshold to assign the classes. A low threshold (e.g. p = 0.1 ⇒ 1 if p > 0.1, 0 otherwise) will lead to a high number of TP but also a high number of FP. A high threshold (e.g. p = 0.9 ⇒ 1 if p > 0.9, 0 otherwise) will lead to a low number of TP but also a low number of FP. In fact, at p = 0, we have:

\(\begin{align*} \bullet \qquad & TP = P \Rightarrow TPR = 1\\ \bullet \qquad & FP = N \Rightarrow FPR = 1 \end{align*}\)

And for p = 1, we have

\(\begin{align*} \bullet \qquad & TP = 0 \Rightarrow TPR = 0\\ \bullet \qquad & FP = 0 \Rightarrow FPR = 0 \end{align*}\)

Now what happens in the case of imbalanced classes? We have N » P. As a consequence, the probability score distribution will tend to be skewed toward smaller values: most of the scores will be small. Most of the true positive samples will be on the high side of the probability score. Let's consider the following label (c) and resulting score (p) lists:

Keep reading with a 7-day free trial

Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.