Bias and Variance

Bias and Variance in ML (Figure 9) are fundamental concepts and expert practitioners usually have a deep understanding of bias/variance related topics.

In the deep learning era there is less discussion about the bias/variance trade-off because in the deep learning era there is less trade-off. So the concepts of bias and variance are still central but their trade-off is no more so important.

The reason for this is that in the pred-deep learning era usually you could reduce bias at the cost of increasing variance or vice-versa, but generally it wasn’t possible to just reduce bias or just reduce variance. Instead, in deep learning, as long as you get a bigger network (in terms of layers or hidden units) you will generally reduce bias without impacting variance (if regularized properly), and as long as you can get more data you will generally reduce variance without impacting bias.

Identify bias/variance from subset error

When only two features are present we can just look at the model (Figure 9) and identify situations of high bias (panel A) or high variance (panel C).

When many features are present we can no longer visualize the model but we can employ some metrics that will help us identify these problems.

Suppose you have a classifier that should identify cat pictures. So $y=1$ for a picture of a cat and $y=0$ for any other pictures.

Suppose you fit your model on the training set and then measure the error on both the training set and development set and obtain the error as in the table below.

Four cases of error (as percentage of miscalssifications) calculated on the train- and test-sets after fitting a model
	case 1	case 2	case 3	case 4
train set	1%	15%	15%	0.5%
dev set	11%	16%	30%	1%

Assuming that a person would have an error $\approx 0%$ and that the train and dev sets are drawn from the same distribution:

case 1 is a case of high variance
case 2 is a case of high bias
case 3 is a case of high bias AND high variance (the worst scenario)
case 4 is a case of low bias and low variance (the best scenario)

It is important to notice that we detected bias and variance based on the assumption that the optimal error, also called Bayes error (ML33) is $\approx 0%$. Would the Bayes error $\approx 15%$, then we can say that case 2 is a case of low bias and low variance.

The difference between Bayes error and training set error is sometimes called Avoidable Bias and the objective is usually that of reducing the gap between train error and Bayes error. in the same way as we define the avoidable bias, we can also define variance, which is the gap between the training set error and the dev set error.

Human level performance

Deep-learning applications are often compared with human-level performance. First, because ML is become so advanced that can perform as well or better than human-level performance. Second, because it turns out that the model workflow is much more efficient when the algorithm is trying to do something that also humans can do. In those settings, it becomes natural to compare with human-level performance

Typically, when designing a ML model, performance will rapidly increase at first, approach and surpass human level performance and then asymptotically approach an ideal performance, called Bayes optimal error, which is the best theoretical function mapping $x \to y$, that can never be surpassed.

png — Figure 47. Typical trend of performance of a trained model with human level performance and Bayes optimal performance.

Interestingly, usually performance increase, really slows down after surpassing human-level performance. This happens for (at least) two reasons:

Human level performance is, for many tasks, very close to Bayes optimal error
So long as ML is worse than humans there are certain things that we can do to improve performance:
- get labeled data from humans
- gain insight from manual error analysis (why did a person get this right?)
- better analysis of bias/variance (ML25)

Define human level performance

The first step towards understanding the human level error for a certain task is to decide what purpose would serve its definition.

Suppose you are developing a model for the classification of radiology scans, and you find the following typical errors:

common person: 3% error
common doctor: 1% error
experienced radiologist: 0.7% error
team of experienced radiologists: 0.5%

If, as often is the case, you are using human level error as a proxy for Bayes error, you would chose option 4 as your human level performance. If you are developing a model for research purposes and you want to demonstrate that the model is usable in real world environment, you would probably chose option 2 as human level performance.

Manual Error analysis

Suppose you are training a cat classifier, which has 90% accuracy and 10% errors. You notice that some dogs are incorrectly classified as cats. Should you try yo make your cat classifier perform better on dogs?

In order to answer this question it is often advisable to manually look at the data in order to advance an estimate of the maximum gain in performance that you might have by doing specific action; this is called error analysis.

In this case, you would extract around 100 mislabeled examples from the dev set and count how many of them are dogs. If 5% of them are dogs, you know that the maximum gain in performance will be to lower you error rate from 10% to 9.5%. However, if you notice that 50% of mislabeled examples are dogs, then you may be able to reduce your error from 10% to 5%.

In this example, we analyzed a single problem (dogs misclassified as cats) but, while you sort through your dev set misclassified examples, you can evaluate multiple options in parallel. For example, you may notice that some misclassified examples are great cats and some others are blurry images.

Image	Dog	Great Cats	Blurry
1	x
2			x
3		x	x
$\vdots$
% of total	8%	43%	61%

Special case of incorrectly labeled examples

Deep learning algorithms are quite robust to random (or near-random) errors in the training set, they are less robust to systematic errors in the training set. When it comes to incorrectly labeled examples in the dev set, it is often a good idea have a separate count for incorrectly labeled examples during the manual error analysis.

The decision on whether or not to go through your entire dev set and correct the incorrectly labeled examples is bound to the same evaluation as for the other type of misclassifications found in manual error analysis. Since the goal of the dev set is to help you select between different models, if you don’t trust anymore the performance evaluated on your dev set, then by all means you should correct misclassifications.

Image	Dog	Great Cats	Blurry	Incorrectly labeled
1
2				x
3		x
$\vdots$				x
% of total	8%	43%	61%	6%

A couple of important caveats are:

Apply the same process to your dev and test sets to make sure that they continue to come from the same distribution
Consider examining examples your algorithm gt right as well as ones it got wrong.
you don’t need to apply mislabeled correction to the train set if you apply it to your dev/test set. Train and dev/test data may now come slightly different distributions.

Mismatched train and dev/test set

In the deep learning era, the need for data is such that sometimes you will use whatever data you have, regardless if it comes from the same distribution of the test/dev set. Training on a different distribution than dev/test set is possible but requires some considerations.

Suppose we are training a model that needs to classify pictures uploaded by users. These pictures are few in number and typically, they’re blurry and shot by unprofessional photographers. On the other hand you have access from the web to a lot of professionally shot, high resolution pictures.

Your final objective is your ML performing well on user-fed pictures and you find yourself in front of a dilemma:

you have a small dataset with a certain distribution that upon which your model needs to perform well
you have a large dataset with a different distribution upon which you could train your model

Option 1: Merge the datasets

One option would be to merge the two datasets, randomly reshuffle them and then split in training dev and test sets.

Advantage: This approach has the advantage that the train and dev/test set will all come from the same distribution.
Disadvantage: The huge disadvantage is that a big portion of the dev set will be composed of the high-quality pictures. This means that model selection (operated on the results of the model on the dev set) will bring to optimize for a different distribution of data from what we care about.

Option 2: Test on the target dataset

Split your target dataset (user-fed pictures) in two batches in an appropriate portion. One portion will go in the training set, the other portion will go in the dev/test set. All of the high-resolution pictures go in the training set.

Advantage: You test your model on your target distribution, optimizing exactly on the distribution we care about
Disadvantage: The training distribution is different from your dev/test distribution, however over the long term this split of data across the subsets usually give better performance.

Bias and Variance with mismatched data

Suppose we are in the situation in which our training set come from a different distribution than our dev/test set. Let’s assume that human level error $\approx 0%$


Training error	1%
Dev error	10%

In a setting where train and dev set come from the same distribution, this is a case of high variance. But when train and dev set come from different distributions this is no more obvious. The 9% increase in error might be due to variance, but it could also be due to the different distributions of the train and dev set. In order to decouple this two problems we define an additional dataset, called training-dev set, that has the same distribution as the training set, but is not used for training.

Before training we define 4 datasets:

Training set
Training-dev set (same distribution as the training set but excluded from training)
Dev set (different distribution than training set)
Test set (if needed)

Suppose we find:

	Case 1	Case 2	Case 3	Case 4	Case 5
Training error	1%	1%	10%	10%	5%
Training-dev error	9%	1.5%	11%	11%	7%
Dev error	10%	10%	12%	20%	4%

Let’s take Case 1. We observe that the error increases of 8% from training data to training-dev data. Since there is no difference in distribution between the two datasets, and the only difference is that the model hasn’t been trained on the training-dev data, we conclude that the model suffer from high variance.

Let’s take Case 2. We observe that the error increases of 0.5% from training data to training-dev data. We conclude that we have low variance and the remaining difference in performance needs to be a data-mismatch problem.

Let’s take Case 3. Remember that human level error (proxy for Bayes error) $\approx 0%$. We observe almost no difference in error between datasets, but a great difference between Bayes error and training error. This is an Avoidable Bias problem.

Let’s take Case 4. We observe that a great difference between Bayes error and training error and a great difference between training-dev error and dev error. We deduce that this model has two issues: An avoidable bias problem and a data-mismatch problem.

Let’s take Case 5. We observe that the dev error is smaller than the training error. We deduce that the training data was much more complex than dev data

A more general formulation

	Training distribution	Dev distribution
Human level	general human level	? (specific human level)
Error on examples NN has trained on	training error	? (error on dev distribution data split in train data)
Error on examples NN hasn’t trained on	training-dev error	dev/test error

We never cited the two columns marked by the question mark, but having data on those two situations can give us interesting information on the next steps to take.

Addressing data mismatch

There are not completely systematic strategies to address data-mismatch problems but we can look at some things we can try.

The first thing we can do is to carry out manual error analysis on the dev set to understand how it is different from the training set (e.g. blurry images, noisy sound)

When we have gained insight on how train and dev set differ, we can try to make training data more similar to dev data; in alternative we can collect more training data from the same (or similar) distribution as the dev set.

We can generate data with the same distribution as training data with artificial data synthesis: combine clean training data with noise resembling the kind of noise that you might encounter in the dev set. There is one big caveat with artificial data synthesis. If we have 100% hours of clean training data and 0.1% of noise data and we apply the noise to all training data by repeating the noise over and over, there is a chance that the model will overfit the noise, not generalizing all possible noise that of that category (e.g. blur).

Basic recipe for correct training

This is a basic recipe to apply when training a model:

png

Training is an iterative process

Training an optimal process at the first try is almost impossible even with a lot of experience in a specific field. As hinted by the previous paragraph, iterating over different options is fundamental. It becomes very important not to overthink a model, especially on the first iterations. A rule of thumb for moving quickly towards good results is build the first system quickly, then iterate:

Set up dev/test set and metric
Build the first system quickly
Use Bias/Variance analysis and Error analysis to prioritize the next steps and iterate

Orthogonalization

In training a ML algorithm, it is important to know what to tune in order to achieve a certain effect. A status in which each hyper-parameters tune exactly one aspect of the model is a status of perfect orthogonalization and is a status that we would like to achieve.

In machine learning, the effects that you want to orthogonalize are:

Fit training set well on cost function (for some applications this means approaching human-level performance): e.g. tune by getting a bigger network
Fit dev set well on cost function: e.g. tune by getting a bigger training set
Fit test set well on cost function: e.g. tune by getting a bigger dev set
Performs well in real world: e.g. tune by changing cost function

An action that doesn’t fit well with orthogonalization is early stopping (ML27), since it tries to simultaneously tune train set and dev set performance.

Satisficing and Optimizing metrics

We have talked in ML17 about the importance of having a single real-number evaluation metric. However, it is not always easy to combine all the desired properties of a model in a single metric. In those cases it is useful to set satisficing and optimizing metrics.

Let’s say that for an image classifier we care about the classification accuracy and about the running time. Suppose we have three classifiers as in the the table below

Three classifiers and their accuracies and running times
	Accuracy	Running time
A	90%	80ms
B	92%	95ms
C	95%	1500ms

We may set some rules that we want models to be subject to:

maximize accuracy
running time $\leq 100 \mathrm{ms}$

In this case accuracy is an optimizing metric because we want to get as good as possible on this metric, while running-time is a satisficing metric because we want to have at least a certain running-time. In general, if we decide that we care about $N$ metrics, we should have 1 optimizing metric and $N-1$ satisficing metrics.