Since 2012 when AlexNet has appeared and broken almost all the records in image classification contests, the landscape of a Computer Vision (CV) community has completely changed. There is an exponential growth in the number of Computer Vision solutions based on convolutional neural networks (CNNs) and this trend seems to have been stable for the last couple of years.

Due to the easiness of reusing professionally pretrained networks (e.g. Microsoft ResNet or Google Inception), implementing and understanding CNNs for tasks like image classification and regression might be relatively easy even for a beginner Deep Learning researcher. But there is a variety of different, more sophisticated visual tasks that can be solved effectively using Deep Learning.

In this post, we will concentrate on tasks, where we are not only interested in image recognition, but also would like to automatically infer the actual position of detected objects from a provided image: object detection, image segmentation and object instance segmentation (with masking).

Figure 1: An example of object detection using Faster R-CNN

Illustraction of object detection with bounding boxes and classes

Object detection

The main aim of object detection task is (obviously) to detect the object (or a set of objects) from a predefined set of classes as well as the minimal sufficient bounding box around each object instance.

This problem is usually tackled by first producing the set of candidate region proposals and then classifying each of them using a neural network designed and trained specifically for this purpose. The most difficult part of this pipeline is to find reasonable proposals effectively.

In the beginning the classic and relatively sophisticated CV techniques were used in order to find appropriate candidates. This slowly evolved into an end-to-end deep learning solution called Faster R-CNN- where CNNs are used for both finding proposals (so-called Region Proposal Networks - RPNs), and classification. What is worth to mention is that these two networks (RPNs and networks used for classification) share a huge part of their parameters what speeds up the training process and increases the accuracy of both candidate region selection as well as its final prediction and evaluation.

Figure 2: Different CNN architectures for image segmentation.

CNN architectures for image segmentation

🔭 Object detection is used to track moving objects. It's useful in analyzing construction sites, city traffic, security camera footage. This is how it helps our client.

Image segmentation

In the most basic image segmentation definition the algorithm task is to perform a pixelwise classification, which assigns one class to a pixel by using the predefined set of classes (or a probability distribution, which measures how a given pixel is probable to belong to the given class). Usually, when any other class is not assigned to the given pixel, an extra background class is added.

The main difficulty of this task is that most of existing CNN topologies tend to favor the increase of the semantic complexity over the actual spatial precision of deeper filters in a network. This is mainly achieved by using pooling/downsampling layers which are decreasing the resolution of feature maps. Although in case of regression and classification such approach is reasonable - mainly because the general features of an image should be properly detected - if the network is to produce the full-resolution feature map these downsampling techniques might result in a really poor performance.


In order to prevent that an intelligent upsampling technique called deconvolution is used. Using this method, which works exactly like upsampling convolution, makes it possible to increase the resolution of a feature map in CNN data flow. Most of the time these upsampled feature maps are combined with precedent feature maps in the network or even the original input image in order to obtain more accurate predictions. The most popular architectures, which work in this manner are U-Nets and Fully Convolutional Neural Networks (FCNs).

Figure 3: An object instance segmentation with masking using Mask R-CNN.

An object instance segmentation with masking using Mask R-CNN

Object instance segmentation with masking

This task might be considered as a mixture of previous two tasks. Here the main purpose is to find a binary mask for every instance of objects from a predefined set of classes.

The current state-of-the-art solution for this problem called Mask R-CNN is a brilliant mixture of the techniques used for both object detection and image segmentation. Whereas in Faster R-CNN the convolutional shared layers model part is a base for both region proposal and classification outputs, in Mask R-CNN an additional segmentation branch has been added.

Thanks to this branch not only the bounding boxes for each instance are provided, but also a full object mask. Furthermore because of the fact that training of the additional image segmentation part influenced also the shared convolutional base - it positively affected other output branches of the model. Thanks to this - model, which has been designed for an object instance segmentation achieved a new record on a highly recognized MS COCO 2016 dataset for an object detection.


As you might have noticed from this brief summary, image recognition and regression are only a tip of the iceberg when it comes to successful deep learning applications. More and more real life applications may experience a huge boost thanks to increasing algorithmic and computational progress.

Moreover, thanks to the fact that more and more solutions are designed in an end-to-end manner the need for Deep Learning expertise is rapidly decreasing.

Eager to read more articles like this one? Follow us on Twitter!