top

Instance Segmentation using Deep Learning

As we all know, object detection is the task of detecting objects in an image in the form of a bounding box. What if we wanted to get a more accurate information about the object? You’d go for more than a rectangle (bounding box), maybe a polygon which represents the object more tightly. But that’s still not the best way. The best way would be to assign each pixel inside the bounding box which actually has the object. This task is called as Instance segmentation, where you segment the object instances.In this guide, we are going to look in depth at a state of the art (SOTA) method which does Instance Segmentation using deep learning. It’s called Mask R-CNN [3], published by the Facebook AI Research (FAIR) team at ICCV 2017. The post assumes a basic understanding of deep learning and CNNs for object detection. For easier understanding, I’ll be using examples of code samples in PyTorch as its pretty popular these days. The excellent Keras implementation is also given in the references [6]. This guide to instance segmentation with deep learning will give you a detailed information about human pose prediction, object detection, and instance segmentation from the image below.                     The bounding boxes are object detection labels while the segmentation maps are the instance segmentation labelsCore IdeaIt builds on the very popular method for object detection, Faster R-CNN. They add another head (branch) for the segmentation task. This makes the total branches to be 3 — classification, bounding box regression, and segmentation. They also enhance the ROIPooling step in FasterRCNN and propose a ROIAlign layer instead. We won’t go into details of Faster R-CNN in this post but enough details will be explained for an understanding of Mask-RCNN.ObjectiveThe focus of the authors is on using simple and basic network design to show the efficiency of the idea/concept. They get the SOTA without any complimentary techniques (eg: OHEM, multi-scale train/test etc). These can be used to further improve accuracy very easily. This isn’t in the scope of the paper.Backbones — ResNets, FPNs and Faster R-CNNIt’s a two-stage network popular for instance-level object understanding, just like Faster R-CNN. The first stage is region proposal network (RPN) and the second stage is the combined object detection, segmentation network.The first-stage is exactly identical to Faster R-CNN. The RPN is proposed and explained in depth in the Faster R-CNN paper [2].The second stage has two parts — (1) Feature Extractor; (2) Task-Specific Heads (branches)The feature extractor as the name suggests is interchangeable and serves as a backbone to extract features. A very popular feature extractor used to be VGG [5] network which was used in the Faster R-CNN paper a few years ago. But better feature extractors have come up recently, namely ResNets and more recently Feature Pyramid Networks (FPNs) which builds on older ResNets. The details of the networks are beyond the scope of this post.The task-specific heads are parallel networks which are trained together. A code sample is shown below. It is taken from the Faster R-CNN code in PyTorch [3]self.fc6 = FC(512 * 7 * 7, 4096) self.fc7 = FC(4096, 4096) self.score_fc = FC(4096, self.n_classes, relu=False) self.bbox_fc = FC(4096, self.n_classes * 4, relu=False)Here, fc6 and fc7 are simple Fully Connected Layers, while score_fc and bbox_fc are corresponding predictors for classification score and bounding box coordinates (or offsets). These are referred to as heads or branches. (Note that both the predictors operate on the same features, which comes from fc7)Here, Loss is a sum of classification loss (L_cls) and bounding box loss (L_box), where L_cls is CrossEntropyLoss and L_box is SmoothL1Loss.Mask HeadOne of the main contributions of the paper is the addition of the Mask head to do the instance segmentation task. This is a fully convolutional network, unlike the other heads which are FC layers.The output of the segmentation task should be a segmentation map big enough to represent an object of average size. The network architecture is taken from the paper and is shown below.Figure 3.Head Architecture: We extend two existing Faster R-CNN heads[19,27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimensions while deconv increases it). All convs are 3x3, except the output conv which is 1x1, deconvs are 2x2 with stride 2, and we use ReLU [30] in hidden layers. Left: 'res5' denotes ResNet's Fifth stage, which for simplicity we altered so that the first conv operates on 7x7 Rol with stride 1 (instead of 14x14 / stride 2 as in [19]). Right: 'x4' denotes a stack of four consecutive convs.Let’s take the FPN backbone for explanation (similar logic applies for ResNet as well)The output feature maps from ResNet is passed as input to a stack of four convolution layers with a constant number of feature maps (256) with a deconvolution layer (size=2) in the end to increase the spatial resolution from 14x14 to 28x28. The last (output) conv is a 1x1 convolution with a number of feature maps a number of classes.A sample code to better understand above. This is a PyTorch Mask R-CNN code taken from [4]. Batch normalization is a normalization layer which is used after most conv layers to help in training faster, being more stable etc.https://gist.github.com/skrish13/e9bc482f18708ae10e5d9511fbae302bThe Mask loss (L_mask) is again CrossEntropy. So the total loss is the sum of L_cls, L_box, L_mask. The network is trained simultaneously on all three heads.ROI AlignOne of their other important contributions is the ROIAlign Layer instead of ROIPool (in Faster R-CNN). This basically doesn’t round off your (x/spatial_scale) fraction to an integer (like it does in the case of ROIPool). Instead, it does bilinear interpolation to find out the pixels at those floating values. The same process is used to get floating point value instead of integers (quantization) while assigning spatial portions into output bins in ROIPoolingFor example: Let’s assume ROI height and width is 54,167 respectively. Spatial scale is basically Image size/FeatureMap size (H/h, W/w), it also called stride in this context. Usually its a square, so we just use one notation.Let’s assume its H=224, h=14. This gives the spatial scale as 16. Dimensions of the corresponding portion in the output feature mapROIPool: 54/16, 167/16 = 3,10ROIAlign: 54/16, 167/16 = 3.375, 10.4375Now we can use bilinear interpolation to get upsample it and get exact pixel values of those positions and not lose the 0.375*16 and 0.4375*16The similar logic goes into separating the corresponding the regions into appropriate bins according to the ROIAlign output shape (eg 7x7). The code example is given below from [5].https://gist.github.com/skrish13/4e10fb46017b7abf459d1eabe5967041Other ExperimentsLots of explanation and ablation studies proving the statements are given in the paper.Usage of multinomial masks vs individual masks (softmax vs sigmoid). The output of the Mask Head can be a K-way classifying softmax output or K-way independent sigmoidal output. It’s shown that independent outputs outperform softmax.Using the information from box head and just predicting the extent of the object instead of classifying each pixel as described above makes the model easier to train. In this case, it’d be just a binary mask (object or not) as the class information is taken from other branches.Using FCNs (fully convolutional network) for segmentation task gives a decent boost in accuracy as expected. Conv layers perform much better in predicting image masks than fully connected layers.Using ROIAlign in place of ROIPool helps to increase the accuracy by a huge margin.Hope, this Instance Segmentation using Deep Learning tutorial gave you a good idea of how to perform instance segmentation using deep learning.References[1] He, Kaiming, Georgia Gkioxari, Piotr Dollár and Ross B. Girshick. “Mask R-CNN.” *2017 IEEE International Conference on Computer Vision (ICCV)* (2017): 2980-2988.[2] Ren, Shaoqing, Kaiming He, Ross B. Girshick and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 39 (2015): 1137-1149.[3] "Faster R-CNN, PyTorch", https://github.com/longcw/faster_rcnn_pytorch[4] "Mask R-CNN, PyTorch", https://github.com/soeaver/Pytorch_Mask_RCNN[5] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.[6] "Mask R-CNN, Keras", https://github.com/matterport/Mask_RCNN
Rated 4.0/5 based on 34 customer reviews
Normal Mode Dark Mode

Instance Segmentation using Deep Learning

sri krishna
Blog
03rd May, 2018
Instance Segmentation using Deep Learning

As we all know, object detection is the task of detecting objects in an image in the form of a bounding box. What if we wanted to get a more accurate information about the object? You’d go for more than a rectangle (bounding box), maybe a polygon which represents the object more tightly. But that’s still not the best way. The best way would be to assign each pixel inside the bounding box which actually has the object. This task is called as Instance segmentation, where you segment the object instances.

In this guide, we are going to look in depth at a state of the art (SOTA) method which does Instance Segmentation using deep learning. It’s called Mask R-CNN [3], published by the Facebook AI Research (FAIR) team at ICCV 2017. The post assumes a basic understanding of deep learning and CNNs for object detection. For easier understanding, I’ll be using examples of code samples in PyTorch as its pretty popular these days. The excellent Keras implementation is also given in the references [6]. This guide to instance segmentation with deep learning will give you a detailed information about human pose prediction, object detection, and instance segmentation from the image below.

The bounding boxes are object detection labels while the segmentation maps are the instance segmentation labels                     The bounding boxes are object detection labels while the segmentation maps are the instance segmentation labels


Core Idea

It builds on the very popular method for object detection, Faster R-CNN. They add another head (branch) for the segmentation task. This makes the total branches to be 3 — classification, bounding box regression, and segmentation. They also enhance the ROIPooling step in FasterRCNN and propose a ROIAlign layer instead. We won’t go into details of Faster R-CNN in this post but enough details will be explained for an understanding of Mask-RCNN.


Objective

The focus of the authors is on using simple and basic network design to show the efficiency of the idea/concept. They get the SOTA without any complimentary techniques (eg: OHEM, multi-scale train/test etc). These can be used to further improve accuracy very easily. This isn’t in the scope of the paper.


Backbones — ResNets, FPNs and Faster R-CNN

  • It’s a two-stage network popular for instance-level object understanding, just like Faster R-CNN. The first stage is region proposal network (RPN) and the second stage is the combined object detection, segmentation network.
  • The first-stage is exactly identical to Faster R-CNN. The RPN is proposed and explained in depth in the Faster R-CNN paper [2].
  • The second stage has two parts — (1) Feature Extractor; (2) Task-Specific Heads (branches)
  • The feature extractor as the name suggests is interchangeable and serves as a backbone to extract features. A very popular feature extractor used to be VGG [5] network which was used in the Faster R-CNN paper a few years ago. But better feature extractors have come up recently, namely ResNets and more recently Feature Pyramid Networks (FPNs) which builds on older ResNets. The details of the networks are beyond the scope of this post.
  • The task-specific heads are parallel networks which are trained together. A code sample is shown below. It is taken from the Faster R-CNN code in PyTorch [3]
self.fc6 = FC(512 * 7 * 7, 4096)
self.fc7 = FC(4096, 4096)
self.score_fc = FC(4096, self.n_classes, relu=False)
self.bbox_fc = FC(4096, self.n_classes * 4, relu=False)
  • Here, fc6 and fc7 are simple Fully Connected Layers, while score_fc and bbox_fc are corresponding predictors for classification score and bounding box coordinates (or offsets). These are referred to as heads or branches. (Note that both the predictors operate on the same features, which comes from fc7)
  • Here, Loss is a sum of classification loss (L_cls) and bounding box loss (L_box), where L_cls is CrossEntropyLoss and L_box is SmoothL1Loss.


Mask Head

  • One of the main contributions of the paper is the addition of the Mask head to do the instance segmentation task. This is a fully convolutional network, unlike the other heads which are FC layers.
  • The output of the segmentation task should be a segmentation map big enough to represent an object of average size. The network architecture is taken from the paper and is shown below.

Mask Head

Figure 3.Head Architecture: We extend two existing Faster R-CNN heads[19,27]. Left/Right panels show the heads for the ResNet C4 and FPN backbones, from [19] and [27], respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context (conv preserves spatial dimensions while deconv increases it). All convs are 3x3, except the output conv which is 1x1, deconvs are 2x2 with stride 2, and we use ReLU [30] in hidden layers. Left: 'res5' denotes ResNet's Fifth stage, which for simplicity we altered so that the first conv operates on 7x7 Rol with stride 1 (instead of 14x14 / stride 2 as in [19]). Right: 'x4' denotes a stack of four consecutive convs.

  • Let’s take the FPN backbone for explanation (similar logic applies for ResNet as well)
  • The output feature maps from ResNet is passed as input to a stack of four convolution layers with a constant number of feature maps (256) with a deconvolution layer (size=2) in the end to increase the spatial resolution from 14x14 to 28x28. The last (output) conv is a 1x1 convolution with a number of feature maps a number of classes.
  • A sample code to better understand above. This is a PyTorch Mask R-CNN code taken from [4]. Batch normalization is a normalization layer which is used after most conv layers to help in training faster, being more stable etc.

https://gist.github.com/skrish13/e9bc482f18708ae10e5d9511fbae302b
The Mask loss (L_mask) is again CrossEntropy. So the total loss is the sum of L_cls, L_box, L_mask. The network is trained simultaneously on all three heads.

ROI Align

One of their other important contributions is the ROIAlign Layer instead of ROIPool (in Faster R-CNN). This basically doesn’t round off your (x/spatial_scale) fraction to an integer (like it does in the case of ROIPool). Instead, it does bilinear interpolation to find out the pixels at those floating values. The same process is used to get floating point value instead of integers (quantization) while assigning spatial portions into output bins in ROIPooling

For example: Let’s assume ROI height and width is 54,167 respectively. Spatial scale is basically Image size/FeatureMap size (H/h, W/w), it also called stride in this context. Usually its a square, so we just use one notation.

Let’s assume its H=224, h=14. This gives the spatial scale as 16. Dimensions of the corresponding portion in the output feature map

  • ROIPool: 54/16, 167/16 = 3,10
  • ROIAlign: 54/16, 167/16 = 3.375, 10.4375
  • Now we can use bilinear interpolation to get upsample it and get exact pixel values of those positions and not lose the 0.375*16 and 0.4375*16

The similar logic goes into separating the corresponding the regions into appropriate bins according to the ROIAlign output shape (eg 7x7). The code example is given below from [5].
https://gist.github.com/skrish13/4e10fb46017b7abf459d1eabe5967041

Other Experiments

Lots of explanation and ablation studies proving the statements are given in the paper.

  • Usage of multinomial masks vs individual masks (softmax vs sigmoid). The output of the Mask Head can be a K-way classifying softmax output or K-way independent sigmoidal output. It’s shown that independent outputs outperform softmax.
  • Using the information from box head and just predicting the extent of the object instead of classifying each pixel as described above makes the model easier to train. In this case, it’d be just a binary mask (object or not) as the class information is taken from other branches.
  • Using FCNs (fully convolutional network) for segmentation task gives a decent boost in accuracy as expected. Conv layers perform much better in predicting image masks than fully connected layers.
  • Using ROIAlign in place of ROIPool helps to increase the accuracy by a huge margin.
    Hope, this Instance Segmentation using Deep Learning tutorial gave you a good idea of how to perform instance segmentation using deep learning.

References

[1] He, Kaiming, Georgia Gkioxari, Piotr Dollár and Ross B. Girshick. “Mask R-CNN.” *2017 IEEE International Conference on Computer Vision (ICCV)* (2017): 2980-2988.
[2] Ren, Shaoqing, Kaiming He, Ross B. Girshick and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” *IEEE Transactions on Pattern Analysis and Machine Intelligence* 39 (2015): 1137-1149.
[3] "Faster R-CNN, PyTorch", https://github.com/longcw/faster_rcnn_pytorch
[4] "Mask R-CNN, PyTorch", https://github.com/soeaver/Pytorch_Mask_RCNN
[5] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.
[6] "Mask R-CNN, Keras", https://github.com/matterport/Mask_RCNN

sri

sri krishna

Blog author
R and D at Paralleldots. Interests: DL, CV, Multimodality, Medical Imaging, ML.

Leave a Reply

Your email address will not be published. Required fields are marked *

Top comments

188bet

13 November 2018 at 10:30am
Thanks very nice blog!

SUBSCRIBE OUR BLOG

Follow Us On

Share on

other Blogs

20% Discount