In this post we show how to encode the position and scale of the object into a segmentation map. Deep convolutional neural networks can then be trained to produce such segmentation masks for input images. Consequently, we can decode the bounding boxes of objects from these segmentation masks using simple (classical) computer-vision algorithms. We use these principles to build a face detector (face detectors are effective to show and there are plenty of annotated datasets available). The basic idea of our approach is illustrated in the following figure.
The idea is to place a diagonal line segment (a "slash") across each face within the image. The location of the segment (its centroid) corresponds to the location of the face. The length of the segment specifies the size of the face. This simple encoding scheme is easily decoded into face bounding boxes in runtime (any line-segment finding algorithm can be used). Of course, there are potential problems. For example, two line segments coming from two different faces could get merged into a single one in a crowded scene. However, such issues won't arise in common scenarios for which our system is intended to be used in.
Modern object detectors are based on deep learning. One of the most successful such methods is Faster RCNN, which routinely tops a lot of benchmarks (e.g., on the COCO dataset). In our opinion, the main drawback of this approach is the complexity of its implementation (we can say that it has a lot of "moving" parts). For example, learning the Faster RCNN system requires tuning four different loss functions simultaneously. These loss functions model the quality of:
All these loss functions have a different scale and, thus, require hand-picked multipliers when summing them together to form the final loss that is optimized with SGD. These multipliers might depend on the dataset for optimal results.
Other object detection frameworks, such as YOLO and SSD, suffer from similar issues.
In this post we describe a simple object-detection method based on semantic segmentation, which is, in our opinion, much simpler to implement and tune than typical object detectors. We showcase the proposed method by developing a system for detecting human faces. Of course, other detection tasks could also be tackled in this manner, as long as you can find a dataset with annotated bounding boxes and have a sufficiently large neural network for extracting features from the images (face detection is a relatively simple task so we can get away with a small network).
Of course, we do not claim that our approach can result in detection accuracy comparable or superior to modern object-detection pipelines. Actually, our educated guess is that it most likely would not (thoroughly verifying this hypothesis is out of the scope of this blog post). As already mentioned, we are mainly focused on implementation simplicity.
We use the standard semantic segmentation pipeline to solve our problem.
Semantic segmentation is the process of labeling each pixel within the image with a class tag from a predefined set of classes. This is useful when a detailed understanding of the image is required. Some such examples include traffic scenes: it might be useful to know which pixels represent the road as this information can then be used in navigation software to propose a driving route. Another obvious example comes from the realm of medical image processing: to locate tumors (or other diseases) and guide the doctors what treatment to apply.
Of course, the idea is to automatize the process of image segmentation through the use of machine-learning algorithms. Earlier systems achieved this using random forests, such as the influential work by MS research team: the Kinect body-part detector. Nowadays, deep, fully convolutional networks dominate the area.
To solve the segmentation task, we require our model to output a prediction of the same (or approximately the same) spatial size as the input image. We call such problems dense prediction problems (besides segmentation, other include, for example, optical-flow calculation and depth prediction). A particularly effective architectures for dense prediction follow the so-called U-Net design, described by Ronneberger et al.. This design follows the usual encoder-decoder architecture but uses feature maps computed in the earlier downsampling steps during each upsampling process. This is visualised in the following figure, which is taken directly from the original paper:
We use this design principle in our face-detection model. However, to achieve real-time detection rates, all the internal feature maps have $16$ channels. This makes the model quite small: it requires just around 100kB of memory for storage. We name this model tiny U-Net due to these properties.
As already explained in the introduction, we classify each pixel into one of the two classes:
See the figure from the introductory section for clarification.
To learn the detection/segmentation network, we will use the Caltech Web Faces dataset. The dataset contains 10,524 annotated frontal human faces in various resolutions and settings. The annotations contain coordinates of the eyes, the nose and the center of the mouth for each frontal face. Some examples can be seen in the following GIF:
Modern pipelines for face alignment (e.g., this one) can use such data to locate facial features with great reliability. However, we are interested just in the location of each face. Thus, we use the provided data to generate a large training set of faces with accompanying bounding boxes, and, consequently, ground-truth segmentation masks which we use in the training process. The bounding boxes are computed by using the annotated facial features and simple anthropometric relations.
Even though our per-pixel classification task has two classes, the model outputs a feature map with a single channel. The values are constrained to the (0, 1) interval by applying a sigmoid to the each element of the output feature map. Note that a more typical approach would be to produce two feature maps and then apply spatial softmax to this output. This is useful in a more general setting when there are multiple classes to segment.
To learn our tiny U-Net, we use RMSProp with learning rate set to $0.0001$. The gradient for each iteration is computed on a randomly sampled minibatch consisting of $32$ images. We learn our model for around $500$ epoch (i.e., 500 passes over the dataset). This takes around one day on a machine with a modern GPU.
The obtained results are encouraging given the size of the network and the amount of training data. See the following images for some examples.
Tiny U-Net is also capable of processing frames in real time. We showcase this in our in-browser demo. Note that tiny U-Net was not trained to detect non-frontal faces. Also, since the speed of WebAssembly programs is expected to improve in the future, so will the linked demo software.