Object detection

Object detection is one of the area that has improved more in the last few years. The first step towards object detection is object localization.

Object localization

Image classification is the problem of predicting the class of an object shown in a picture. A task of image classification with localization produces a predicted label for the image and also finds the exact location of the labeled object within the image. Finding the location of an object means defining a bounding box that contains the recognized object. Finally an object detection task detects multiple objects (even of different classes) in an image and all their locations.

To train a network on an object localization task we can build on the image classification architectures that we have seen so far. Suppose we have a network for an image classification task that needs to distinguish 3 classes: flower, leaf, background (none of the two). We would have input images, fed into a CNN with some convolutional layers and some final fully connected layers that terminates with a softmax regression layer with 3 hidden units. In order to train a network to localize the classified object we need 4 additional output units: $b_x$, $b_y$, $b_h$, $b_w$. These 4 numbers parameterize the bounding box of the detected object.

png — Figure 102. Simplified architecture of a CNN trained on a task of object localization. It takes as input an image, whose representation passes through some convolutional layers and finally to some densely connected layers. The output layers has 4 units dedicated to parameterize the bounding box for localization, with $b_x$, $b_y$ (the coordinates of the midpoint of the bounding box) and $b_w$, $b_h$ (the width and height of the bounding box)

A CNN trained on a localization task needs a labeled training set with the four parameters of the bounding box defined for each example. In particular the label vector $y$ will be:

\[y= \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \end{bmatrix}\]

where $p_c$ is the probability that the picture contains an object (either 1 or 0), $b_x, b_y, b_w, b_h$ are the parameters of the bounding box and $c_1$ and $c_2$ are 1 if the object in the picture is class 1 or 2 respectively, otherwise they are 0. When $p_c=0$ we can ignore all other values of the label vector $y$, since the only feature that we care to learn is that in that image there is no object. This is reflected in the loss function $\mathcal{L}(\hat{y},y)$

\[\mathcal{L}(\hat{y},y)= \begin{cases} \sum_{i=1}^{c+4}(\hat{y}_i - y_i)^2 & \text{if } y_i = 1 \\ (\hat{y}_i - y_i)^2 & \text{if } y_i = 0 \\ \end{cases}\]

Here squared error loss function is used for all labels where in reality squared error is used for the bounding box labels and $p_c$, while for the 2 class labels log loss is usually employed although probably using squared error would work just fine.

Landmark detection

By expanding object localization to a more general case, we can train a neural network to output $x,y$ coordinates of relevant points (landmarks) in an image. Suppose we are building a face recognition application and we want to identify some important features of a face, as for example the eyes and the mouth. To train a network on the task to recognize these landmarks, the label vector $y$ would contain the eyes coordinates and mouth coordinates.

png — Figure 103. Some face landmarks drawn in red over a grayscale face picture

Obviously, in order to train a neural network to output landmark coordinates it has to be trained on a labeled dataset, which needs to be laboriously and precisely annotated for each landmark in each picture.

In Figure 103 we can see a typical set of landmarks for face-oriented tasks. Another example of landmark detection not applyed to face pictures is pose detection where landmarks usually reflect body joints (e.g. shoulder, knee) that, connected toghether, form a sort of skeleton of a person.

Object detection

A convnet can be trained on object detection using a technique called sliding window detection. Suppose we want to train an algorithm for a self-driving car, which should be able to detect multiple type of objects and multiple instances of objects in a picture (figure below).

png — Figure 104. A frame of a video-feed with multiple cars correctly localized by an object detection algorithm

In order to obtain an object detection capable neural network, we need to train it on a training where each example is a picture of a closely cropped car (figure below) and the label simply tells if the picture contains a car or not.

png — Figure 105. Training set for an object detection algorithm with closely cropped images of the object of interest

The next step, would be to train a CNN that takes as input a picture and tells if the picture is a car $\hat{y}=1$ or not $\hat{y}=0$. Once trained, this CNN is used in a sliding window detection system, the CNN is fed an image bound by a box that rolls over the image from left to right and from top to bottom. While in Figure 107 a rather large stride is used in reality the stride is small enough to be able to pass to the CNN each portion of the image that can contain a car and so the stride will be much smaller. The process of rolling the window over the whole image is repeated with windows of different sizes

Figure 107. A sliding window system with a box (window) sliding through an image used as a bounding box for producing crops of the image that cover its entirety. These crops are fed into a specialized CNN that evaluates the presence $y=1$ or absence $y=0$ of an object.

An approach like that described above would have a huge computational cost. In general, sliding window detection systems are very computationally heavy, since you need to run the model for each crop produced by the sliding window. So, with a small enough stride to have an acceptable granularity and an adequate number of window sizes, the number of prediction would be very large. When the sliding window technique was invented, the models run through each step were mostly linear, so the computational cost could be contained, but running a CNN this many times would take too much time for a real time application like that needed by a self-driving system. In order to surpass this issue, a convolutional implementation of the sliding window is employed in place of this traditional implementation.

Convolutional implementation of the sliding window

The classic implementation of the sliding window would be too slow if the function applied to the sliding window is a CNN. The convolutional implementation of the sliding window is much more efficient. To build towards it let’s explore how to convert fully connected layers into convolutional layers. Suppose you have a CNN as in panel A of the figure below, with some convolutional early layers and some fully connected late layers. In order to convert the fully connected layers to a convolutional representation, a number of filters ($=n_c^{[l]} = w^{[l]} = h^{[l]}$, that will perform a single step of convolution (thus stride is irrelevant) with the whole input volume (the figure below, panel B). The output volume of all these filters will have the dimensions $1 \times 1 \times n_c^{[l]}$. By following this strategy all fully connected layers can be converted to convolutional representations with the same number of units as their fully connected counterparts.

png — Figure 106. A Classic CNN that takes as input a $14 \times 14 \times 3$ image, with a few convolutional early layers and some final fully connected layers (A); the same network with fully connected layers converted in convolutional volumes with the same number of units of their fully connected counterparts (B); A slightly bigger input than anticipated ($16 \times 16 \times 3$) is fed to the network in panel B (surplus shown as orange cells) and how the surplus propagates in the network (C)

Suppose we have a classic CNN (Figure 106, panel C); Analogously, the top-right corner in the output volume corresponds to a window in the top-right corner of the input image; and so on.

We can visualize how the activations values of a $14 \times 14$ window propagate across the CNN layers in panel C of Figure 106. The blue portion of the volume face in the $16 \times 16$ input will activate exactly the blue windows in all the other layers up to the blue portion of the output layer. By Moving the blue window with a stride $s=2$ to the right we will produce the top-right corner in the output layer and the same applies for the bottom left and the bottom right blue windows producing the bottom left and bottom right output cells, respectively

Instead of running the CNN $n$ times on $n$ windows of the input image and independently produce $n$ predictions, this implementation combines all forward-propagations in 1 single computation, it shares most of the computations in the regions that are common to all $n$ windows and outputs $n$ predictions.

In panel C of Figure 106 is defined by the $f=2$ of the max-pooling layer.

One of the problems of this approach is that the position of the bounding boxes is not going to be accurate. This problem can be fixed with an approach called YOLO

Object detection

Object localization

Landmark detection

Object detection

Convolutional implementation of the sliding window

Comments