svg

The Cost function for a single training example is called the Loss function $\mathcal{L}$

\[\frac{1}{2}\left(h_\theta\left(x^{(i)}\right)-y^{(i)}\right)^2 \equiv \mathcal{L} \implies J(\theta)=\frac{1}{m}\sum_{i=1}^m\mathcal{L}\]

Loss Function

For a single training example, the Cost function is called the Loss function $\mathcal{L}$

\[\begin{equation*} \mathcal{L}= \begin{cases} -log(h_\theta(x)), & \text{if } y=1\\ -log(1-h_\theta(x)), & \text{if } y=0\\ \end{cases} \end{equation*}\]

svg

This cost function has some desirable properties:

For $y=1$
- $h_\theta(x)\to1 \implies \mathcal{L}\to0$. This is what we want because there should be an increasingly smaller cost when $h_\theta(x)\to y$ and no cost when $h_\theta(x) = y = 1$
- $h_\theta(x) \to 0 \implies \mathcal{L}\to\infty$. This captures the intuition that if $y=1$ and $P(y=1\mid x;\theta)$ the algorithm is penalized by a large cost
For $y=0$
- $h_\theta(x)\to1 \implies \mathcal{L}\to\infty$. There is a big cost associated to predicting 1 when $y=0$
- $h_\theta(x)\to0 \implies \mathcal{L}\to0$. There is no cost associated to predicting 0 when $y=0$

Simplified cost function

Since $y\in{0,1}$ we can write the $\mathcal{L}$ function in a simpler way and compress the two cases in one equation.

\[\mathcal{L}(h_\theta(x),y)=-y\log(h_\theta(x))-(1-y)\log(1-h_\theta(x))\]

When $y=1$:

\[\begin{align} \mathcal{L}(h_\theta(x),y)&=-1\cdot\log(h_\theta(x))-0\cdot\log(1-h_\theta(x))\\ &=-\log(h_\theta(x)) \end{align}\]

When $y=0$:

\[\begin{align} \mathcal{L}(h_\theta(x),y)&=-0\cdot\log(h_\theta(x))-1\cdot\log(1-h_\theta(x))\\ &=-\log(1-h_\theta(x)) \end{align}\]

Cost function

Now that we have a more compact way of writing the cost function we can write it for the whole dataset

\[\begin{align} J(\theta)&=\frac{1}{m}\sum_{i=1}^m\mathcal{L}\left(h_\theta \left(x^{(i)} \right),y^{(i)}\right)\\ &=-\frac{1}{m}\left[\sum_{i=1}^my^{(i)}\log\left(h_\theta(x^{(i)})\right)+\left(1-y^{(i)}\right)\log\left(1-h_\theta(x^{(i)})\right)\right] \end{align}\]

Although there are other cost functions that can be used this cost function can be derived from statistics using the principle of maximum likelihood estimation.

In order to estimate the parameters with this cost function we have to find the parameters $\theta$ that minimize $J(\theta)$

Gradient descent

To minimize $J(\theta)$ we are going to use gradient descent

\[\begin{equation*} \theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta) \end{equation*} \label{eq:gdescent} \tag{1}\]

Where we repeat $\eqref{eq:gdescent}$ for all element $\theta_j$ of the parameters vector $\theta$ updating the parameters simoultanueously (after calculating them).

Deriving the term $\theta_j$ we have:

\[\frac{\partial}{\partial\theta_j}=\frac{1}{m}\sum_{i=1}^m\left(h_\theta(x^{(i)})-y^{(i)}\right)x_j^{(i)}\]

Plugging the derived term into the gradient descent we obtain

\[\theta_j := \theta_j-\frac{\alpha}{m}\sum_{i=1}^m\left(h_\theta(x^{(i)})-y^{(i)}\right)x_j^{(i)}\]

This looks identical to gradient descent for linear regression, however the definition of $h_\theta(x)$ is changed and is now $\frac{1}{1+e^{\theta^Tx}}$

Vectorization

Vectorized implementations of the cost function and the gradient descent are

Cost function

\[\begin{align} &h = g(\theta x)\\ &J(\theta)=\frac{1}{m}\left(-y^T\log(h)-(1-y)^T\log(1-h)\right) \end{align}\]