As we all know, object detection is the task of detecting objects in an image in the form of a bounding box. What if we wanted to get a more accurate information about the object? You’d go for more than a rectangle (bounding box), maybe a polygon which represents the object more tightly. But that’s still not the best way. The best way would be to assign each pixel inside the bounding box which actually has the object. This task is called as Instance segmentation, where you segment the object instances.
In this guide, we are going to look in depth at a state of the art (SOTA) method which does Instance Segmentation using deep learning. It’s called Mask R-CNN , published by the Facebook AI Research (FAIR) team at ICCV 2017. The post assumes a basic understanding of deep learning and CNNs for object detection. For easier understanding, I’ll be using examples of code samples in PyTorch as its pretty popular these days. The excellent Keras implementation is also given in the references . This guide to instance segmentation with deep learning will give you a detailed information about human pose prediction, object detection, and instance segmentation from the image below.
The bounding boxes are object detection labels while the segmentation maps are the instance segmentation labels
It builds on the very popular method for object detection, Faster R-CNN. They add another head (branch) for the segmentation task. This makes the total branches to be 3 — classification, bounding box regression, and segmentation. They also enhance the ROIPooling step in FasterRCNN and propose a ROIAlign layer instead. We won’t go into details of Faster R-CNN in this post but enough details will be explained for an understanding of Mask-RCNN.
The focus of the authors is on using simple and basic network design to show the efficiency of the idea/concept. They get the SOTA without any complimentary techniques (eg: OHEM, multi-scale train/test etc). These can be used to further improve accuracy very easily. This isn’t in the scope of the paper.
self.fc6 = FC(512 * 7 * 7, 4096) self.fc7 = FC(4096, 4096) self.score_fc = FC(4096, self.n_classes, relu=False) self.bbox_fc = FC(4096, self.n_classes * 4, relu=False)
Figure 3.Head Architecture: We extend two existing Faster R-CNN heads[19,27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from  and , respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimensions while deconv increases it). All convs are 3x3, except the output conv which is 1x1, deconvs are 2x2 with stride 2, and we use ReLU  in hidden layers. Left: 'res5' denotes ResNet's Fifth stage, which for simplicity we altered so that the first conv operates on 7x7 Rol with stride 1 (instead of 14x14 / stride 2 as in ). Right: 'x4' denotes a stack of four consecutive convs.
The Mask loss (L_mask) is again CrossEntropy. So the total loss is the sum of L_cls, L_box, L_mask. The network is trained simultaneously on all three heads.
One of their other important contributions is the ROIAlign Layer instead of ROIPool (in Faster R-CNN). This basically doesn’t round off your (x/spatial_scale) fraction to an integer (like it does in the case of ROIPool). Instead, it does bilinear interpolation to find out the pixels at those floating values. The same process is used to get floating point value instead of integers (quantization) while assigning spatial portions into output bins in ROIPooling
For example: Let’s assume ROI height and width is 54,167 respectively. Spatial scale is basically Image size/FeatureMap size (H/h, W/w), it also called stride in this context. Usually its a square, so we just use one notation.
Let’s assume its H=224, h=14. This gives the spatial scale as 16. Dimensions of the corresponding portion in the output feature map
The similar logic goes into separating the corresponding the regions into appropriate bins according to the ROIAlign output shape (eg 7x7). The code example is given below from .
Lots of explanation and ablation studies proving the statements are given in the paper.
 He, Kaiming, Georgia Gkioxari, Piotr Dollár and Ross B. Girshick. “Mask R-CNN.” *2017 IEEE International Conference on Computer Vision (ICCV)* (2017): 2980-2988.
 Ren, Shaoqing, Kaiming He, Ross B. Girshick and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 39 (2015): 1137-1149.
 "Faster R-CNN, PyTorch", https://github.com/longcw/faster_rcnn_pytorch
 "Mask R-CNN, PyTorch", https://github.com/soeaver/Pytorch_Mask_RCNN
 Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.
 "Mask R-CNN, Keras", https://github.com/matterport/Mask_RCNN