Neural networks learn their own features

Let’s take the network the previous lesson and focus on the last two layers

svg

What is left in this neural network is simply logistic regeression, where we use the output unit (or logistic regression unit) to build the hypothesis $\hat{y}$

\[\hat{y} = \sigma \left(w_{10}^{[2]}a_0^{[1]}+w_{11}^{[2]}a_1^{[1]}+w_{12}^{[2]}a_2^{[1]}+ w_{13}^{[2]}a_3^{[1]} \right)\]

Where the features fed into logistic regression are the values in $a^{[1]}$. And here resides the fundamental difference between neural networks and logistic regression: the features $A^{[1]}$ they themselves are learned as functions of the input $x$ with some other set of parameters $W^{[1]}$

The neural network, instead of being constrained to feed the features $x$ to logistic regression, learns its own features $A^{[1]}$ to feed into logistic regression. Depending on the parameters $W^{[1]}$, it can learn some complex features and result in a better hypothesis that you could have if you were constrained to use features $x$ or even if you had to manually set some higher order polynomial features combining the features $x$.

Neural networks can have different number and dimension of hidden layers and the way a neural network is connected is called its architecture.

svg

How neural networks build complex non-linear functions

In this section we will explain how a neural network can build relatively complex non-linear functions.

Let’s take a non-linear classification example like that depicted below, where panel A is just a simplified version of panel B.

svg

This is a non-linear classification example modeled by the logical XNOR function

\[\begin{align} &x_1 \; \text{XNOR} \; x_2 \\ &\text{NOT} \; (x_1 \; \text{XOR} \; x_2) \end{align}\]

Logical AND function

Let’s look at a neural network that can calculate the logical $\text{AND}$ function.

\[\begin{align} &x_1,x_2\in \lbrace 0,1 \rbrace\\ &y= x_1 \wedge x_2\\ &W^{[1]} = \begin{bmatrix}-30\\20\\20\end{bmatrix} \end{align}\]

svg

So that

\[\begin{equation} \hat{y} = \sigma(-30+20x_1+20x_2) \end{equation} \label{eq:h} \tag{1}\]

Since the sigmoid activation function $\sigma(z)$ is

svg

The the output of $\eqref{eq:h}$ is

$x_1$	$x_2$	$\hat{y}$
0	0	$\sigma(-30) \approx 0$
0	1	$\sigma(-10) \approx 0$
1	0	$\sigma(-10) \approx 0$
1	1	$\sigma(10) \approx 1$

Which is exactly the $x_1 \wedge x_2$.

Logical OR function

The following network and table show instead $x_1 \vee x_2$

svg

\[\hat{y} = \sigma(-10+20x_1+20x_2)\]

$x_1$	$x_2$	$\hat{y}$
0	0	$\sigma(-10) \approx 0$
0	1	$\sigma(10) \approx 1$
1	0	$\sigma(10) \approx 1$
1	1	$\sigma(30) \approx 1$

Logical NOT function

svg

$x_1$	$\hat{y}$
0	$\sigma(10) \approx 1$
1	$\sigma(-10) \approx 0$

Logical NOT-1 AND NOT-2 function

svg

$x_1$	$x_2$	$\hat{y}$
0	0	$\sigma(10) \approx 1$
0	1	$\sigma(-10) \approx 0$
1	0	$\sigma(-10) \approx 0$
1	1	$\sigma(-30) \approx 0$

Logical XNOR function

svg

$x_1$	$x_2$	$a_1^{[1]}$	$a_2^{[1]}$	$\hat{y}$
0	0	0	1	1
0	1	0	0	0
1	0	0	0	0
1	1	1	0	1

Neural network multi-class classification

Multiclass classification in neural network is an extension of the on vs all method. Let’s say that we want to build an image processing algorithm that can distinguish between four class of vehicles. We will build a neural network with 4 output units, each of which will model one of the output classes $C$

\[h\Theta(x) = \begin{bmatrix} P(y_1 \mid x, \Theta) \\ P(y_2 \mid x, \Theta) \\ P(y_3 \mid x, \Theta)\\ P(y_4 \mid x, \Theta) \end{bmatrix}\]

svg

So that $\hat{y}_i$ can be one of the following

\[\hat{y}_i \approx \begin{bmatrix}1\\0\\0\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\1\\0\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\0\\1\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\0\\0\\1\end{bmatrix}\]