Mini-batch gradient descent

When training a deep-learning model, you training set might be create some problems at training.

For example the MNIST training set consists of 60000 images. Training an algorithm on the whole dataset would:

require a lot of memory
cumulate the rounding errors
slow down training

In order to prevent this problems, the training set is usually split in mini-batches. For example you might split a 5,000,000 examples training set in 1,000 mini-batches ($t$) of 5,000 training examples each. You would have 1,000 feature vectors $X^{{t}}, Y^{{t}}$.

From here you would proceed by iterating over your 1000 mini-batches in each training epoch. Below you can see some pseudocode representing the process, where I focus on epoch and mini-batches and remain less rigorous on layers:

iteration = 0
for epoch in range(n_epochs):         # number of epochs
    for t in range(n_minibatches):    # 1000 mini-batches of size 5000 each
        iteration += 1
        a = forward_prop(w, x[t], b)
        J[t] = compute_cost(a, y[t])
        dw = backprop(x[t], y[t])
        w := update_weigths(dw)

Whereas in batches you expect the value of the cost function $J$ to monotonically decrease with the number of iterations and if this doesn’t happen is a signal of some error in the implementation of gradient descent, in mini-batch gradient descent for each $t$ we could have a local increase or decrease f $J$, depending on how hard the examples in the mini-batch are. There should still be a trend down at the increase of $t$ (Figure 60).

png — Figure 60. The value of J over many epochs in batch gradient descent (A) and over many mini-batches (B)

Size of mini-batch

The size of the mini-batch ($m_b$) is an hyperparameter that you will need to chose.

If $m_b=m$ we have classic batch gradient descent where $(X^{{1}}, Y^{{1}}) = (X, Y)$. Batch gradient descent will bring our cost function steadily towards the minimum and to convergence.
If $m_b=1$ we have an algorithm called stochastic gradient descent where every example is its own mini-batch $(X^{{1}}, Y^{{1}}) = (x^{(1)}, y^{(1)})$. Stochastic gradient descent will bring on average our cost function towards the minimum but with many detours and will never converge.
In practice $1 < m_b < m$. This will ensure the fastest learning. Typical mini-batches sizes are power of 2, that tend to run faster due to computer architectural implementations. Usually these sizes are used: $64, 128, 256, 512$.

Nomenclature

The introduction mini-batches and epochs creates ambiguity in some terminology. To disambiguate, especially in deep-learning

The dataset of size $m$ is split in $b$ mini-batches of size $m_b$
Each iteration an algorithm visits a mini-batch
Each epoch an algorithm visits the entire dataset (all mini-batches)

When $m_b=m$ (batch gradient descent) the definition of iteration and epoch coincides, this is beacuse sometimes we will find epochs referred to as iterations.

Mini-batch gradient descent

Size of mini-batch

Nomenclature

Comments