Gradient descent

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in the hypothesis function. That’s where gradient descent comes in.

Imagine that we graph the cost function based on its fields $\theta_0$ and $\theta_1$. We are not graphing $x$ and $y$ itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.

We put $\theta_0$ on the $x$ axis and $\theta_1$ on the $y$ axis, with the cost function on the vertical $z$ axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts such a setup.

/tmp/ipykernel_21168/140559370.py:19: MatplotlibDeprecationWarning: Calling gca() with keyword arguments was deprecated in Matplotlib 3.4. Starting two minor releases later, gca() will take no keyword arguments. The gca() function should only be used to get the current axes, or if no axes exist, create new axes with default keyword arguments. To create a new axes with non-default arguments, use plt.axes() or plt.subplot().
  ax = fig.gca(projection='3d')

svg

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red point shows the minimum in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter $\alpha$, which is called the learning rate.

For example, the distance between each ‘star’ in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken is determined by the partial derivative of $J(\theta_0,\theta_1)$. Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm consists in repeating until convergence:

\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)\]

where $j=0,1$ represents the feature index number.

When the partial derivative of $\theta_j$ is analytically solved, the gradient descent beocmes:

\[\begin{align} & \text{repeat until convergence } \big\lbrace \\ & \theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^m \left(h_\theta \left(x^{(i)}\right) - y^{(i)} \right)x^{(i)} \qquad \text{for } j := 0 \dots n \\ &\big\rbrace \end{align}\]

At each iteration $j$, one should simultaneously update the parameters $\theta_1, \theta_2,…,\theta_n$. Updating a specific parameter prior to calculating another one on the $j^{(th)}j $(th) iteration would yield to a wrong implementation.

When written in tis vectorial form, the simultaneus update is already encoded in its expression:

\[\frac{\partial}{\partial\theta_j}J(\theta)=X^T (X\theta-y)\]

svg

Feature scaling

We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

\[−1 ≤ x_{(i)}x\]

\[−0.5 ≤ x_{(i)}x\]

These aren’t exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

\[x_i := \dfrac{x_i - \mu_i}{s_i}x\]

Where $μ_i$ is the average of all the values for feature $(i)$ and $s_i$ is the range of values (max - min), or $s_i$ is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation. For example, if $x_i$ represents housing prices with a range of 100 to 2000 and a mean value of 1000, then,

\[x_i := \dfrac{price-1000}{1900}x\]

Learning rate

Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease $\alpha$.

Automatic convergence test: Declare convergence if $J(\theta)$ decreases by less than $E$ in one iteration, where $E$ is some small value such as $10^{−3}$. However in practice it’s difficult to choose this threshold value.

It has been proven that if learning rate $\alpha$ is sufficiently small, then $J(\theta)$ will decrease on every iteration.

To summarize:

If $\alpha$ is too small: slow convergence;
If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.

Normal equation

Gradient descent gives one way of minimizing $J$. Let’s discuss a second way of doing so, this time performing the minimization explicitly and without resorting to an iterative algorithm. In the “Normal Equation” method, we will minimize $J$ by explicitly taking its derivatives with respect to the $\theta_j$ ’s, and setting them to zero. This allows us to find the optimum theta without iteration. The normal equation formula is given below:

\[\theta = (X^T X)^{-1}X^T y\]

There is no need to do feature scaling with the normal equation.

The following is a comparison of gradient descent and the normal equation:

Gradient descent	Normal equation
Need to chose alpha	No need to choose alpha
Needs many iteration	No need to iterate
$O(kn^2)$	$O(n^3)$, need to calculate the inverse of $X^TX$
Works well when $n$ is large	Slow if $n$ is very large

With the normal equation, computing the inversion has complexity $\mathcal{O}(n^3)$. So if we have a very large number of features, the normal equation will be slow. In practice, when $n$ exceeds $10\,000$ it might be a good time to go from a normal solution to an iterative process.

Gradient descent

Feature scaling

Learning rate

Normal equation

Comments