Sequence Models

Models like Recurrent Neural Networks (RNNs) have transformed many fields, as for example speech recognition and natural language processing. There are a lot of different types of sequence problems that can be treated as supervised learning problems:

Notation

Suppose you want a sequence model that extracts character names from a sentence $x$ and classifies each word as character name (1) or not (0). This is an example of Natural Language Processing.

x:	Rand is a friend of Perrin and they both live in the twin rivers
y:	1    0  0 0      0  1      0   0    0    0    0  0   0    0

The input is a sequence of 13 words ($T_x=13$). They are represented by a set of 13 features to which we will refer as $x^{\langle 1 \rangle}, \dots ,x^{\langle 13 \rangle}$ or more in general they will be referred to as $x^{\langle t \rangle}$ where $t$ implies that they are temporal sequences regardless whether the sequentiality is in the time dimension or not.

Similarly we refer to $y^{\langle t \rangle}$ which has length $T_y$. In this example $T_x = T_y = 13$ ($y^{\langle 1 \rangle}, \dots,y^{\langle 13 \rangle}$) but they can be different. This means that the length of the input and output sequence don’t need to match. For example a sentiment analysis network has $T_x >= 1, T_y = 1$

Each training example $X^{(i)}$ has a label associated to some features $t$. To refer to feature $t$ of training example $i$ we use $X^{(i)\langle t \rangle}$. Each training example might have a different number of input features $T_x$. To refer to the number of input features of training example $i$ we use $T_x^{(i)}$. Similarly $y^{(i)\langle t \rangle}$ refers to the $t$-th element in the output sequence of the $i$-th training example and $T_y^{(i)}$ refers to the length of the output sequence in the $i$-the training example.

In the example of Natural Language Processing, one of the simplest representation of individual words $x^{\langle t \rangle}$ is based on a vocabulary (or dictionary) $V$. The vocabulary $V$ is a set of all the words that can be used in a representation. Once a dictionary is defined, one way to represent a word, called one-hot representation, is with a vector of the same size of the dictionary where the only non-zero value is at the index of the word in the vocabulary.

\[V = \begin{bmatrix} \text{a} \\ \vdots \\ \text{is} \\ \vdots \\ \text{Rand} \\ \vdots \\ \text{zulu} \end{bmatrix} \begin{matrix} 1 \\ \vdots \\ 2100 \\ \vdots \\8000 \\ \vdots \\ 10.000 \end{matrix} \qquad x^{\langle 1 \rangle}= \begin{bmatrix} 0 \\ \vdots \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} \qquad x^{\langle 2 \rangle}= \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ \vdots \\ 0 \end{bmatrix}\]

In this representation $x^{\langle t \rangle}$ is a one-hot vector and the task is given this representation for $x$ to learn a mapping to the target output $y$. This is faced as a supervised learning problem trained on labeled data $(x,y)$.

Recurrent Neural Network Model

The task defined above is to produce a mapping $x \to y$, where $x$ is a series of $T_x$ one-shot vectors and $y$ is a series of $T_y$ output values. With a regular neural network we would incur in a number of problems:

A RNN has none of these problems: When modelling this data in a recurrent neural network we would take the first word of a sentence $x^{\langle 1 \rangle}$ and feed it into a layer of the RNN. This layer would predict if the first word is a person’s name or not, producing the output $y^{\langle 1 \rangle}$. In the second step, the second word $x^{\langle 2 \rangle}$ is passed to the layer together with the activation of the previous step $a^{\langle 1 \rangle}$, to produce $y^{\langle 2 \rangle}$. At each time-step the RNN feeds the input $x^{\langle t \rangle}$ and the activation of the previous time-step $a^{\langle t-1 \rangle}$ to produce a prediction $y^{\langle t \rangle}$. The first layer is usually fed a vector of zeros or random values $a^{\langle 0 \rangle}$.

png
Figure 125. Two equivalent representations of a recurrent neural network (RNN) model. In the left panel (A), each box represents a time step. One element in the input sequence $x^{\langle t \rangle}$ is fed to an hidden layer, which takes as an additional input the activations of the previous step $a^{\langle t-1 \rangle}$. Each step produces as output a vector $y^{\langle t \rangle}$. In the right panel (B) the same process is represented as the layer being fed the input $x$ and weighting it with a set of weights $W$ to produce the output $y$ in a loop for each time step $t$.

In an RNN a single set of parameters ($W_{ax}$) for every time-step governs the connection from $x^{\langle i \rangle}$ to the hidden layer for every time step. A single set of parameters ($W_{aa}$) governs the connection from one time-step to the next and a single set of parameters ($W_{ya}$) governs the connection from the hidden layer to the output $\hat{y}^{\langle i \rangle}$.

A weakness of the architecture of the RNN depicted in Figure 125 is that it only uses the information that is earlier in the sequence to make predictions. For example, when predicting $\hat{y}^{\langle 3 \rangle}$ it does’t use information from $x^{\langle 4 \rangle}, \dots x^{\langle T_x \rangle}$. Suppose we have the sentences:

He said, Teddy Roosevelt was a great president
He said, Teddy bears are one sale!

In order to decide whether the words Teddy is a person’s name it would be much more useful to consider information from the last few words rather than only information from the first two words.

This is amended by Bidirectional RNN (BRNN) but this simpler Uni-direction RNN architecture considers only information from earlier time-steps.

Forward propagation

The calculation performed in the RNN architecture discussed up to this point start with the all-zero vector $a^{\langle 0 \rangle}$. The computation than proceeds through the computation of $a^{\langle 1 \rangle}$ and $\hat{y}^{\langle 1 \rangle}$ and so on.

\[\begin{split} &a^{\langle 0 \rangle} = \vec{0}\\ &a^{\langle 1 \rangle} = g\left(W_{aa}a^{\langle 0 \rangle} + W_{ax}x^{\langle 1 \rangle} + b_a\right) \\ &\hat{y}^{\langle 1 \rangle} =g\left(W_{ya}a^{\langle 1 \rangle} + b_y \right ) \end{split}\]

Where $W_{ax}$ is the matrix of parameters that map input $x$ to activation $a$; $W_{ya}$ is the matrix of parameters that map activation $a$ to output $\hat{y}$; $W_{aa}$ is the matrix of parameters that map activation in the previous step $a$ to activation in the current step $a$. The activation functions $g$ applied to the $a$ and $y$ are usually different, a $\tanh$ or $\text{ReLU}$ is used for activations $a$, while for the output $\hat{y}$ a sigmoid (for binary classification) or softmax (for multiclass classification) is used. More generally, at each time step:

\[\begin{split} &a^{\langle t \rangle} = g\left(W_{aa}a^{\langle t-1 \rangle} + W_{ax}x^{\langle t \rangle} + b_a\right) \\ &\hat{y}^{\langle t \rangle} =g\left(W_{ya}a^{\langle t \rangle} + b_y \right ) \end{split}\]

Or, in a slightly simplified form

\[\begin{split} &a^{\langle t \rangle} = g\left(W_a \left [a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right ] + b_a\right) \\ &\hat{y}^{\langle t \rangle} =g\left(W_{y}a^{\langle t \rangle} + b_y \right ) \end{split}\]

Where $W_{a} = \left[W_{aa} \vert W_{ax} \right]$ (horizontal stack of the matrices) and $\left [a^{\langle t-1 \rangle}, x^{\langle t \rangle} \right ]$ is a vertical stack of the two matrices.

Backpropagation through time

Backpropagation propagates the distance of the prediction $\hat{y}$ from the label $y$; the loss function $\mathcal{L}$ measures this distance. In an RNN we have a step-wise loss function $\mathcal{L}^{\langle t \rangle}$, which is the logistic regression loss function, also called cross-entropy loss. Loss values of single steps are used in a sequence-wise loss function $\mathcal{L}$

\[\begin{split} & \mathcal{L}^{\langle t \rangle} \left(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle} \right) = -y^{\langle t \rangle} \log \hat{y}^{\langle t \rangle} - \left(1-y^{\langle t \rangle} \right) \log \left (1-\hat{y}^{\langle t \rangle} \right) \\ & \mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_y} \mathcal{L}^{\langle t \rangle} \left(\hat{y}^{\langle t \rangle}, y^{\langle t \rangle} \right) \end{split}\]

Once the loss is computed it is propagated back to earlier time-steps of the RNN in a process referred to as backpropagation through time.

svg

Comments