Neural networks learn their own features
Let’s take the network the previous lesson and focus on the last two layers
What is left in this neural network is simply logistic regeression, where we use the output unit (or logistic regression unit) to build the hypothesis $\hat{y}$
\[\hat{y} = \sigma \left(w_{10}^{[2]}a_0^{[1]}+w_{11}^{[2]}a_1^{[1]}+w_{12}^{[2]}a_2^{[1]}+ w_{13}^{[2]}a_3^{[1]} \right)\]Where the features fed into logistic regression are the values in $a^{[1]}$. And here resides the fundamental difference between neural networks and logistic regression: the features $A^{[1]}$ they themselves are learned as functions of the input $x$ with some other set of parameters $W^{[1]}$
The neural network, instead of being constrained to feed the features $x$ to logistic regression, learns its own features $A^{[1]}$ to feed into logistic regression. Depending on the parameters $W^{[1]}$, it can learn some complex features and result in a better hypothesis that you could have if you were constrained to use features $x$ or even if you had to manually set some higher order polynomial features combining the features $x$.
Neural networks can have different number and dimension of hidden layers and the way a neural network is connected is called its architecture.
How neural networks build complex non-linear functions
In this section we will explain how a neural network can build relatively complex non-linear functions.
Let’s take a non-linear classification example like that depicted below, where panel A is just a simplified version of panel B.
This is a non-linear classification example modeled by the logical XNOR function
\[\begin{align} &x_1 \; \text{XNOR} \; x_2 \\ &\text{NOT} \; (x_1 \; \text{XOR} \; x_2) \end{align}\]Logical AND function
Let’s look at a neural network that can calculate the logical $\text{AND}$ function.
\[\begin{align} &x_1,x_2\in \lbrace 0,1 \rbrace\\ &y= x_1 \wedge x_2\\ &W^{[1]} = \begin{bmatrix}-30\\20\\20\end{bmatrix} \end{align}\]So that
\[\begin{equation} \hat{y} = \sigma(-30+20x_1+20x_2) \end{equation} \label{eq:h} \tag{1}\]Since the sigmoid activation function $\sigma(z)$ is
The the output of $\eqref{eq:h}$ is
$x_1$ | $x_2$ | $\hat{y}$ |
---|---|---|
0 | 0 | $\sigma(-30) \approx 0$ |
0 | 1 | $\sigma(-10) \approx 0$ |
1 | 0 | $\sigma(-10) \approx 0$ |
1 | 1 | $\sigma(10) \approx 1$ |
Which is exactly the $x_1 \wedge x_2$.
Logical OR function
The following network and table show instead $x_1 \vee x_2$
$x_1$ | $x_2$ | $\hat{y}$ |
---|---|---|
0 | 0 | $\sigma(-10) \approx 0$ |
0 | 1 | $\sigma(10) \approx 1$ |
1 | 0 | $\sigma(10) \approx 1$ |
1 | 1 | $\sigma(30) \approx 1$ |
Logical NOT function
$x_1$ | $\hat{y}$ |
---|---|
0 | $\sigma(10) \approx 1$ |
1 | $\sigma(-10) \approx 0$ |
Logical NOT-1 AND NOT-2 function
$x_1$ | $x_2$ | $\hat{y}$ |
---|---|---|
0 | 0 | $\sigma(10) \approx 1$ |
0 | 1 | $\sigma(-10) \approx 0$ |
1 | 0 | $\sigma(-10) \approx 0$ |
1 | 1 | $\sigma(-30) \approx 0$ |
Logical XNOR function
$x_1$ | $x_2$ | $a_1^{[1]}$ | $a_2^{[1]}$ | $\hat{y}$ |
---|---|---|---|---|
0 | 0 | 0 | 1 | 1 |
0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 0 | 1 |
Neural network multi-class classification
Multiclass classification in neural network is an extension of the on vs all method. Let’s say that we want to build an image processing algorithm that can distinguish between four class of vehicles. We will build a neural network with 4 output units, each of which will model one of the output classes $C$
\[h\Theta(x) = \begin{bmatrix} P(y_1 \mid x, \Theta) \\ P(y_2 \mid x, \Theta) \\ P(y_3 \mid x, \Theta)\\ P(y_4 \mid x, \Theta) \end{bmatrix}\]So that $\hat{y}_i$ can be one of the following
\[\hat{y}_i \approx \begin{bmatrix}1\\0\\0\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\1\\0\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\0\\1\\0\end{bmatrix} \;, \; \begin{bmatrix}0\\0\\0\\1\end{bmatrix}\]