ShelfNet for Real-time Semantic Segmentation

  • 2018-11-27 20:57:48
  • Juntang Zhuang, Junlin Yang
  • 36

Abstract

In this project, we present ShelfNet, a lightweight convolutional neuralnetwork for accurate real-time semantic segmentation. Different from thestandard encoder-decoder structure, ShelfNet has multiple encoder-decoderbranch pairs with skip connections at each spatial level, which looks like ashelf with multiple columns. The shelf-shaped structure provides multiple pathsfor information flow and improves segmentation accuracy. Inspired by thesuccess of recurrent convolutional neural networks, we use modified residualblocks where two convolutional layers share weights. The shared-weight blockenables efficient feature extraction and model size reduction. We testedShelfNet with ResNet50 and ResNet101 as the backbone respectively: theyachieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a 512x512input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, itachieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset.ShelfNet achieved both higher mIoU and faster inference speed compared withstate-of-the-art real-time semantic segmentation models. We provide theimplementation https://github.com/juntang-zhuang/ShelfNet.

 

Quick Read (beta)

ShelfNet for Real-time Semantic Segmentation

Juntang Zhuang
Biomedical Engineering, Yale University
[email protected]
   Junlin Yang
Biomedical Engineering, Yale University
[email protected]
Abstract

In this project, we present ShelfNet, a lightweight convolutional neural network for accurate real-time semantic segmentation. Different from the standard encoder-decoder structure, ShelfNet has multiple encoder-decoder branch pairs with skip connections at each spatial level, which looks like a shelf with multiple columns. The shelf-shaped structure provides multiple paths for information flow and improves segmentation accuracy. Inspired by the success of recurrent convolutional neural networks, we use modified residual blocks where two convolutional layers share weights. The shared-weight block enables efficient feature extraction and model size reduction. We tested ShelfNet with ResNet50 and ResNet101 as the backbone respectively: they achieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a 512×512 input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, it achieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50 backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset. ShelfNet achieved both higher mIoU and faster inference speed compared with state-of-the-art real-time semantic segmentation models. We provide the implementation https://github.com/juntang-zhuang/ShelfNet.

1 Introduction

Semantic segmentation is the key to image understanding [8, 26], and is related to other tasks such as scene parsing, object detection and instance segmentation[21, 46]. The task of semantic segmentation is to assign each pixel a unique class label, and can be viewed as a dense classification problem. Deep learning has achieved state-of-the-art performance in many computer vision tasks, such as image classification, semantic segmentation, object recognition, motion tracking and image captioning [18, 7, 15]. Recently many convolutional neural networks (CNN) have achieved remarkable results on semantic segmentation tasks, and the success of deep learning models in semantic segmentation has benefited a wide range of related fields, including medical image analysis and diagnosis [34], remote sensing [42], visual tracking and video surveillance [37] and auto driving [38].

Figure 1: Results of our method on PASCAL VOC validation dataset. Columns from left to right represent: input images, ground truth annotations, predictions from ShelfNet with ResNet50 backbone, predictions from ShelfNet with ResNet101 backbone. Both models are trained with MS COCO and PASCAL augmented dataset. Our method achieves both high accuracy and fast inference speed for real-time semantic segmentation.
(a) Structure of ShelfNet.
(b) Shared-weights residual block.
Figure 2: Structure and modules of the ShelfNet. (a) Structure of ShelfNet. A-D represent different spatial levels. Columns 1-4 represent different branches: 1 and 3 are called “encoder” (down-sample) branches; 2 and 4 are called “decoder” (up-sample) branches. Outputs from ResNet backbone have 256, 512, 1024 and 2048 channels at levels A-D respectively, other tensors (of branches 1-4) at levels A-D have 64, 128, 256, 512 channels respectively. The spatial sizes of output tensors are 1/4, 1/8, 1/16, 1/32 of input size at levels A-D respectively. The final output is linearly up-sampled to match the size of the input image.(b) Shared-weights residual block. Two conv layers share weights, and two batch-normalization layers have different weights.

However, the success of most deep learning models for semantic segmentation comes at a price of heavy computation burden. The training of CNNs on a large dataset such as PASCAL VOC [8], Cityscapes [5] and ADE20K [46] typically takes several days on a single GPU, and the running time during test phase is usually hundreds of milliseconds (ms) or more, which hinders their application in tasks of real-time processing. In this project, we focus on the real-time semantic segmentation problem, and propose a method that can achieve both fast running speed and high segmentation accuracy.

There has been much research on removing redundancy of deep neural networks for faster speed, such as pruning [10, 11, 14] and distillation [29, 17, 32]. However, most of them require pre-training of a large model on a large dataset; furthermore, the running speed is typically insufficient for real-time semantic segmentation. Another way to get faster running speed is to reduce the number of channels in the model, however this method typically yields lower accuracy. Therefore, we aim to propose a new architecture of CNNs for accurate real-time segmentation.

Most state-of-the-art semantic segmentation models belong to the family of “encoder-decoder” structure, where the image is progressively down-sampled then up-sampled. The up-sample process can use skip connections from the down-sample process or not: FCN [24] and DeepLab [3] has an encoder-decoder structure without skip connections; U-Net [33] and DeeLab v3 [4] fuses multi-scale features from both the down-sample and up-sample process to utilize both high-level and low-level features.

However, there has been little effort on networks that are different from a standard encoder-decoder structure. In this project, we propose ShelfNet with a structure like a multi-column shelf instead of one encoder-decoder pair, and demonstrate its superior performance in real-time semantic segmentation. We argue that the special structure of ShelfNet enables efficient information flow, and demonstrate its high accuracy and fast running speed on PASCAL VOC, PASCAL Context and Cityscapes datasets. Our main contributions are listed as follows:

1. We propose a multi-path convolutional neural network (ShelfNet, Fig. 2), which has a structure in the shape of a shelf with multiple columns. Different from the standard encoder-decoder strucutre, ShelfNet has multiple encoder-decoder pairs, with skip connections at each spatial resolution level. The unique structure of ShelfNet greatly increases the number of paths from input to output, and improves information flow in the network.

2. We propose an efficient modification of residual block. Inspired by the success of recurrent convolutional neural network [1], we propose to use shared weights of two convolutional layers in a residual block. The shared-weights design enables more efficient feature extraction and reduces the size of the model. Furthermore, we add a drop-out layer between two convolutional layers in a residual block to avoid overfitting.

3. We validate the performance of ShelfNet on various benchmark datasets, including PASCAL VOC [8], PASCAL Context [26] and Cityscapes [5]. During test phase on a GTX 1080Ti GPU with a 512×512 input image, ShelfNet achieves 59 FPS and 42 FPS with ResNet50 and ResNet101 backbone respectively. ShelfNet achieves both high running speed and high accuracy: on PASCAL VOC 2012 test set, ShelfNet achieves a mean intersection over union (mIoU) of 84.2% with ResNet101 backbone, and achieves 82.8% mIoU with ResNet 50 backbone; ShelfNet with ResNet50 backbone achieves 75.8% mIoU on Cityscapes test set, and 45.6% mIoU on PASCAL Context test set.

2 Related Work

2.1 Semantic segmentation

Semantic segmentation has been a hot topic for many years. Before the recent rise of deep learning, early approaches mainly relied on handcrafted features such as HOG [6] and SIFT [25]. The features are then fed into classifiers such as SVM [16] and random forest classifier. These methods cannot be trained end-to-end, and the performance of models heavily rely on the design of handcrafted features. Since the resurgence of deep learning, especially fully convolutional neural networks (FCN) [24], deep learning models have been widely used for semantic segmentation. The end-to-end training enables the neural network to learn complicated features automatically without handcrafted features, and achieve much higher accuracy than traditional semantic segmentation algorithms.

(a)Sizes of output tensors from ResNet
(b)Sizes of output tensors from ResNet with dilated convolution
Figure 3: Comparison between conv-pool strategy and dilated convolution. Pixels colored in red are used for computation. Conv-pool strategy loses spatial resolution because of pooling, but dilated convolution generates high-resolution tensor without pooling. Dilated convolution improves segmentation accuracy at the cost of a much larger computational burden, therefore is not suitable for real-time segmentation.
Figure 4: ShelfNet (gray background, the structure is the same as Fig. 2) can be viewed as an ensemble of FCNs. A few examples of information flow paths are marked with different colors. Each path is equivalent to an FCN (except that there are pooling layers in ResNet backbone). The equivalence to an ensemble of FCN enables ShelfNet to perform accurate segmentation with a small neural network.

FCN has an encoder-decoder structure, where the image is gradually spatially down-sampled then up-sampled to generate a segmentation map. Since the success of FCN, many convolutional neural networks with an encoder-decoder structure have emerged. U-Net [33] is a widely used model for medical image segmentation, it has an encoder-decoder structure with skip connections from down-sample branch (encoder branch) to up-sample branch (decoder branch) at different spatial resolutions; the special design of U-Net enables fusion of low-level and high-level features and helps to generate high accuracy. RefineNet [19] also has an encoder-decoder structure, where the decoder has a special module called “Chained Residual Pooling” to enable multi-path information flow.

Both U-Net and RefineNet use a “convolution-pooling” strategy for the encoder. The pooling layer reduces spatial resolution and harms prediction accuracy. To overcome this problem, Chen et al. proposed DeepLab [3] based on dilated convolution, where the receptive field of kernel is dilated as shown in Fig. 3. For example, a standard convolutional layer with a kernel size of 3 takes a 3×3 square (spatial) as input to compute a feature, while a dilated convolutional layer with a dilation rate of 2 takes a 5×5 square (spatial) as input, but only uses pixels at corners, edge center, and center of the square (9 pixels in total). Instead of pooling, a dilated CNN gradually increases dilation rate to increase the size of receptive field, but does not shrink the size of output tensor. Therefore, dilated CNN has a better spatial resolution compared with “conv-pool” strategy. For example, a standard ResNet shrinks image size to 1/32 of input size, while a dilated ResNet shrinks image size to 1/4 of input size. Other networks, such as PSPNet [43] and EncNet [41] are based on dilated CNN, but considers the scene parsing or context information for more accurate segmentation.

2.2 Real-time semantic segmentation

State-of-the-art semantic segmentation models suffer from a long running time. The success of DeepLab v3, PSPNet and EncNet based on dilated CNN comes at a price of heavy computational burden. Compared with the “conv-pool” strategy, the dilated CNN outputs a tensor with the same channel number but a much larger spatial size, therefore the running speed is significantly reduced. Therefore, networks based on dilated CNNs are not suitable for real-time semantic segmentation. Other strategies rely on refinement of prediction based on traditional models such as conditional random field (CRF), which are also computationally expensive and hard to deploy on a GPU, hence unsuitable for real-time applications.

There have been several approaches for real-time semantic segmentation by modifying a large network to a light-weight version. For example, ICNet [44] is a modification of PSPNet, and deals with multiple images scales. ICNet achieves 30 FPS on a 1024×2048 image with a 70.6% mIoU on Cityscapes test set, but the robustness to low-resolution is not extensively validated. Light-Weight RefineNet is a modification of RefineNet [27], where the kernel sizes of some convolutional layers are reduced from 3×3 to 1×1. Similar to our strategy, Light-Weight RefineNet also reduces the channel number of outputs from ResNet for faster running speed. On PASCAL VOC 2012 test set, Light-Weight RefineNet achieves 81.1% mIoU and 55 FPS with ResNet50 backbone, and achieves 82.7% mIoU and 32 FPS with ResNet152 backbone. Other real-time segmentation models achieve high running speed at the cost of accuracy, SegNet [2] achieves 40 FPS on a 360×480 image with 57.0% mIoU on Cityscapes, and ENet [30] achieves 20 FPS on a 1920×1080 image but only achieves 58.3% mIoU on Cityscapes. All models mentioned in this paragraph can be viewed as modifications of the encoder-decoder structure. In this project, we propose a network with multiple encoder-decoder pairs and skip connections at different spatial levels (e.g. A-D in Fig. 2), and demonstrate its superior performance over previous methods both in inference speed and segmentation accuracy.

3 Methods

3.1 Structure of ShelfNet

We propose ShelfNet, a multi-branch convolutional neural network for semantic segmentation as shown in Fig. 2. Features in different spatial scales are named with letters A to D, and columns are named with numbers 1 to 4. We name column 1 and 3 as encoder branches (down-sample branch), and name column 2 and 4 as decoder branches (up-sample branch). We use convolution with a stride of 2 in encoder branches, and use transposed convolution with a stride of 2 in decoder branches. The number of channels is doubled and the spatial size is reduced by half from a low level to a higher level (e.g., A to B). We use ResNet as the backbone for ShelfNet in this project. To reduce the number of channels for faster inference speed, we use a 1×1 convolutional layer followed by a batch-normalization and a relu layer to convert the number of channels from 256, 512, 1024, 2048 (outputs from backbone) into 64, 128, 256, 512 for levels A-D respectively. The output tensor from block A4 goes through a 1×1 convolutional layer and a softmax operation to generate predictions.

Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU FPS
FCN [24] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2 -
DeepLabv2 [3] 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.6 -
CRF-RNN [45] 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4 78.2 60.4 80.5 77.8 83.1 80.6 59.5 82.8 47.8 78.3 67.1 72.0 -
Deconvnet [28] 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.5 -
GCRF [36] 85.2 43.9 83.3 65.2 68.3 89.0 82.7 85.3 31.1 79.5 63.3 80.5 79.3 85.5 81.0 60.5 85.5 52.0 77.3 65.1 73.2 -
DPN [23] 87.7 59.4 78.4 64.9 70.3 89.3 83.5 86.1 31.7 79.9 62.6 81.9 80.0 83.5 82.3 60.5 83.2 53.4 77.9 65.0 74.1 -
Piecewise [20] 90.6 37.6 80.0 67.8 74.4 92.0 85.2 86.2 39.1 81.2 58.9 83.8 83.9 84.3 84.8 62.1 83.2 58.2 80.8 72.3 75.3 -
ResNet38 [39] 94.4 72.9 94.9 68.8 78.4 90.6 90.0 92.1 40.1 90.4 71.7 89.9 93.7 91.0 89.1 71.3 90.7 61.3 87.7 78.1 82.5 13
PSPNet [43] 91.8 71.9 94.7 71.2 75.8 95.2 89.9 95.9 39.3 90.7 71.7 90.5 94.5 88.8 89.6 72.8 89.6 64.0 85.1 76.3 82.6 11
EncNet [41] 94.1 69.2 96.3 76.7 86.2 96.3 90.7 94.2 38.8 90.7 73.3 90.0 92.5 88.8 87.9 68.7 92.6 59.0 86.4 73.4 82.9 12
ShelfNet50 (ss) 93.4 62.0 84.1 66.2 72.0 92.0 87.3 89.8 28.3 84.5 68.5 87.0 86.4 86.1 85.2 66.3 88.8 54.0 80.4 71.3 77.5 59
ShelfNet101 (ss) 92.6 64.0 85.4 71.1 76.6 94.3 90.5 94.1 36.0 91.9 71.6 91.3 91.7 89.2 86.1 69.7 92.6 58.1 86.7 73.4 81.1 42
ShelfNet50 94.0 63.2 86.1 68.9 73.3 93.6 87.7 91.5 31.4 87.1 67.9 89.5 88.8 86.2 85.5 69.9 88.5 56.1 82.4 72.3 79.0 59
ShelfNet101 93.6 64.2 86.9 69.7 76.2 93.4 90.5 94.4 37.0 91.7 71.1 91.2 91.5 88.9 86.2 72.7 92.6 58.5 85.8 72.4 81.1 42
Table 1: Results on PASCAL VOC test set without pre-training on COCO. ShelfNet with ResNet50 and ResNet101 as backbone are named as ShelfNet50 and ShelfNet101 respectively. We implemented several models and measured the inference speed on a 512×512 image as input with a single GTX 1080Ti GPU. Results from single-scale inputs are marked with (ss), otherwise are tested with multiple-scales inputs with scales in [0.5, 0.75, 1.0, 1.25, 1.5, 1.75].
Method aero bike bird boat bottle bus car cat chair cow tale dog horse mbike person plant sheep sofa train tv mIoU FPS
CRF-FCN [45] 90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1 74.7 -
Dilation8 [40] 91.7 39.6 87.8 63.1 71.8 89.7 82.9 89.8 37.2 84 63 83.3 89 83.8 85.1 56.8 87.6 56 80.2 64.7 75.3 -
DPN [23] 89 61.6 87.7 66.8 74.7 91.2 84.3 87.6 36.5 86.3 66.1 84.4 87.8 85.6 85.4 63.6 87.3 61.3 79.4 66.4 77.5 -
Piecewise [20] 94.1 40.7 84.1 67.8 75.9 93.4 84.3 88.4 42.5 86.4 64.7 85.4 89 85.8 86 67.5 90.2 63.8 80.9 73 78.0 -
DeepLabv2 [3] 92.6 60.4 91.6 63.4 76.3 95 88.4 92.6 32.7 88.5 67.6 89.6 92.1 87 87.4 63.3 88.3 60 86.8 74.5 79.7 -
RefineNet [19] 95 73.2 93.5 78.1 84.8 95.6 89.8 94.1 43.7 92 77.2 90.8 93.4 88.6 88.1 70.1 92.9 64.3 87.7 78.8 83.4 14
ResNet38 [39] 96.2 75.2 95.4 74.4 81.7 93.7 89.9 92.5 48.2 92 79.9 90.1 95.5 91.8 91.2 73 90.5 65.4 88.7 80.6 84.9 13
PSPNet [43] 95.8 72.7 95 78.9 84.4 94.7 92 95.7 43.1 91 80.3 91.3 96.3 92.3 90.1 71.5 94.4 66.9 88.8 82 85.4 11
DeepLabv3 [4] 96.4 76.6 92.7 77.8 87.6 96.7 90.2 95.4 47.5 93.4 76.3 91.4 97.2 91 92.1 71.3 90.9 68.9 90.8 79.3 85.7 8
EncNet [41] 95.3 76.9 94.2 80.2 85.2 96.5 90.8 96.3 47.9 93.9 80 92.4 96.6 90.5 91.5 70.8 93.6 66.5 87.7 80.8 85.9 12
ShelfNet50 (ss) 95.0 73.6 93.1 71.8 70.8 93.5 87.7 92.6 34.4 92.1 76.6 88.3 94.4 89.2 89.0 70.9 91.0 58.5 86.3 74.6 81.9 59
ShelfNet101 (ss) 94.9 74.2 94.3 74.6 82.8 95.3 90.8 91.9 32.6 88.8 78.5 88.2 93.9 91.8 89.4 69.8 91.5 60.1 88.5 77.4 83.1 42
ShelfNet50 95.62 71.47 94.2 72.4 74.3 94.1 88.4 92.6 35.6 93.9 77.8 88.2 95.5 89.7 88.7 71.3 91.4 61.6 87.9 77.1 82.8 59
ShelfNet101 95.41 73.87 94.9 75.7 83.2 96.3 91.2 93.9 35.3 90.0 79.4 90.2 94.2 92.8 90.1 73.2 92.3 64.5 88.0 77.5 84.2 42
Table 2: Results on PASCAL VOC test set with pre-training on COCO. Results from single-scale inputs are marked with (ss), otherwise are tested with multiple-scales inputs with scales in [0.5, 0.75, 1.0, 1.25, 1.5, 1.75].
Model RefineNet-101 RefineNet-152 RefineNet-LW-50 RefineNet-LW-101 RefineNet-LW-152 ShelfNet-50 ShelfNet-101
mIoU, % 82.4 83.4 81.1 82.0 82.7 82.8 84.2
FPS 17 14 53 37 29 59 42
Table 3: Results on PASCAL VOC test set. Comparison with state-of-the-art real-time semantic segmentation models
Figure 5: Example predictions of ShelfNet on PASCAL Context dataset.
Figure 6: Results of ShelfNet on Cityscapes validation dataset. We demonstrate results of models trained with the fine-labeled dataset here.

3.2 ShelfNet as a chain of SegNets

SegNet [2] has a convolutional encoder-decoder structure with skip connections between down-sample and up-sample branches. Here we show ShelfNet can be viewed as a chain of modified SegNets. Looking at only branches 3 and 4 in Fig. 2, it has a similar structure as SegNet, except that outputs from down-sample branch and up-sample branch are summed up in ShelfNet, but concatenated in SegNet. Ignoring the difference in the backbone, branches 1 and 2 can be viewed as another SegNet. The two sub-SegNets are connected at levels A-C between branches 2 and 3. Similar to the structure in Fig. 2, we can add another pair of down-sample and up-sample branches (denoted as branches 5 and 6) after branch 4, with skip connections between branches 4 and 5 to generate a more complicated ShelfNet.

3.3 ShelfNet as an ensemble of FCNs

ShelfNet can be viewed as an ensemble of FCNs. Andreas et al. [35] argued that ResNet behaves like an ensemble of shallow networks, because the residual connections provide multiple paths for efficient information flow. Similarly, ShelfNet provides multiple paths of information flow. For ease of representation, we denote backbone as column 0 and list a few paths here as an example as shown in Fig. 4: (1) (Blue line in Fig. 4)A0A1A2A3A4, (2) (Green line in Fig. 4) A0A1A2A3B3C3C4B4A4, (3) (Red line in Fig. 4)A0B0B1B2A2A3A4, (4) (Orange line in Fig. 4)A0B0C0D0D1D2C2B2B3C3C4B4A4. Each path can be viewed as a variant of FCN (except that there are pooling layers in ResNet backbone). Therefore, ShelfNet has the potential to capture more complicated features and produce higher accuracy.

The effective number of FCN paths in ShelfNet is much larger compared to SegNet. The total number of paths grows exponentially with the number of encoder-decoder pairs (e.g column 1 and 2, 3 and 4 are two pairs) and the number of spatial levels (e.g., A to D in Fig. 2). Not considering the effective paths generated from residual connections in ResNet, for a SegNet with 4 spatial levels (A-D), the total number of FCN paths is 4; for a ShelfNet with the same spatial levels, the total number of FCN paths is 29. The special structure of ShelfNet greatly increases the number of effective FCN paths, thus generating higher segmentation accuracy.

3.4 Shared-weights residual block

Compared with SegNet, the larger effective number of FCN paths comes at a price of extra blocks. To reduce the model size and extract features more efficiently, we propose a modified residual block as shown in Fig. 2 (b). The two convolutional layers in the same block share the same weights, but the two batch normalization layers are different. The shared-weights design reuses weights of convolution, and has similar features as the recurrent convolutional neural network (RCNN) [1]. A drop-out layer is added between two convolutional layers to avoid overfitting. The shared-weights residual block combines the strength of skip connection, recurrent convolution and drop-out regularization, and has much fewer parameters than a standard residual block.

Model BaseNet mIoU, %
FCN-8s [24] 37.8
CRF-RNN [45] 39.3
ParseNet [22] 40.4
Piecewise [20] 43.3
DeepLab-v2[3] Res101-COCO 45.7
RefineNet [19] Res101 47.1
RefineNet [19] Res152 47.3
EncNet [41] 51.7
ShelfNet Res50 45.6
ShelfNet Res101 48.4
Table 4: Segmentation results on PASCAL-Context dataset
Method road swalk build. wall fence pole tlight sign veg. terrain sky person rider car truck bus train mbike bike mIoU
SegNet [2] - - - - - - - - - - - - - - - - - - - 56.1
ENet [30] - - - - - - - - - - - - - - - - - - - 68.3
ICNet [44] - - - - - - - - - - - - - - - - - - - 69.5
CRF-RNN [45] 96.3 73.9 88.2 47.6 41.3 35.2 49.5 59.7 90.6 66.1 93.5 70.4 34.7 90.1 39.2 57.5 55.4 43.9 54.6 62.5
FCN [24] 97.4 78.4 89.2 34.9 44.2 47.4 60.1 65 91.4 69.3 93.9 77.1 51.4 92.6 35.3 48.6 46.5 51.6 66.8 65.3
SiCNN+CRF [13] 96.3 76.8 88.8 40 45.4 50.1 63.3 69.6 90.6 67.1 92.2 77.6 55.9 90.1 39.2 51.3 44.4 54.4 66.1 66.3
DPN [23] 97.5 78.5 89.5 40.4 45.9 51.1 56.8 65.3 91.5 69.4 94.5 77.5 54.2 92.5 44.5 53.4 49.9 52.1 64.8 66.8
Dilation10 [12] 97.6 79.2 89.9 37.3 47.6 53.2 58.6 65.2 91.8 69.4 93.7 78.9 55 93.3 45.5 53.4 47.7 52.2 66 67.1
LRR [9] 97.7 79.9 90.7 44.4 48.6 58.6 68.2 72 92.5 69.3 94.7 81.6 60 94 43.6 56.8 47.2 54.8 69.7 69.7
DeepLab [3] 97.9 81.3 90.3 48.8 47.4 49.6 57.9 67.3 91.9 69.4 94.2 79.8 59.8 93.7 56.5 67.5 57.5 57.7 68.8 70.4
Piecewise [20] 98 82.6 90.6 44 50.7 51.1 65 71.7 92 72 94.1 81.5 61.1 94.3 61.1 65.1 53.8 61.6 70.6 71.6
PSPNet [43] 98.6 86.2 92.9 50.8 58.8 64 75.6 79 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4
PSPNet+coarse [43] 98.6 86.6 93.2 58.1 63 64.5 75.2 79.2 93.4 72.1 95.1 86.3 71.4 96 73.5 90.4 80.3 69.9 76.9 80.2
ShelfNet50 98.1 82.4 91.4 46.8 52.2 58.3 68.8 74.0 92.9 69.6 95.0 82.4 60.7 94.8 53.6 65.3 56.23 60.3 72.2 72.4
ShelfNet50 (ss) 98.1 81.9 91.2 45.9 50.7 57.5 67.6 73.0 92.7 68.7 94.8 80.5 56.9 94.5 50.6 64.2 54.0 58.8 70.6 71.2
ShelfNet101 98.5 85.3 91.7 49.7 55.2 62.9 72.5 77.6 93.3 72.0 94.7 85.7 69.4 95.1 53.9 61.7 52.6 67.1 74.6 74.4
ShelfNet50 + coarse (ss) 98.4 84.5 92.2 46.4 58.4 62.4 71.9 76.3 93.2 71.1 95.2 84.2 66.1 95.3 57.4 71.6 57.7 65.1 74.4 74.8
ShelfNet50 + coarse 98.5 85.2 92.5 46.9 59.9 63.9 73.6 77.8 93.4 72.4 95.4 85.4 67.9 95.6 58.5 72.0 58.9 67.2 75.9 75.8
Table 5: Results on Cityscapes dataset. Models trained with both fine-labeled and coarse-labeled data are marked with “+coarse”, otherwise are trained with fine labelled dataset only. Results from single-scale inputs are marked with (ss), otherwise are tested with multiple-scales inputs with scales in [0.5, 0.75, 1.0, 1.25, 1.5, 1.75].

4 Experiments and results

We carried out extensive experiments to validate the fast inference speed and high accuracy of ShelfNet on several different datasets, including PASCAL VOC 2012, PASCAL Context and Cityscapes. Performance is measured by mean intersection over union (mIoU). The training strategies are slightly different for different tasks; we report them in detail in the following section.

4.1 PASCAL VOC 2012

4.1.1 Implementation details

PASCAL VOC 2012 [8] contains 20 object classes with one background class. PASCAL VOC dataset is split into a training set, a validation set and a test set, with 1464, 1449 and 1456 images respectively. We use the augmented PASCAL VOC dataset [13] containing 10582, 1449 and 1456 images for training, validation and test set. MS COCO [21] is also used as extra training data to generate higher accuracy.

All models are implemented with PyTorch [31] 0.4.1. We use a ResNet pretrained on ImageNet [7] as backbone. We use ResNet with BottleNeck block instead of BasicBlock to capture deeper features of the model and generate more accurate results. Learning rate is scheduled in the form lr=baselr×(1-itertotal_iter)power, and cross-entropy loss is used. The weight-decay is set as 1e-4. The model is first trained with Stochastic Gradient Descent (SGD) optimizer on MS COCO dataset for 30 epochs with a base learning rate of 0.01, then trained on PASCAL augmented dataset for 50 epochs with a base learning rate of 0.01, and finally fine-tuned on original PASCAL VOC dataset for 50 epochs with a base learning rate of 0.001. For data augmentation, the image is randomly flipped and scaled between 0.5 to 2, and randomly rotated between -10 and 10 degrees. The image is cropped into size 512×512 for training and the batch size is set as 12. The results are evaluated on the PASCAL evaluation server. We provide anonymous links to our results in the footnote. 11 1 http://host.robots.ox.ac.uk:8080/anonymous/5NMB0K.html 22 2 http://host.robots.ox.ac.uk:8080/anonymous/LTORA3.html 33 3 http://host.robots.ox.ac.uk:8080/anonymous/J5UUKK.html 44 4 http://host.robots.ox.ac.uk:8080/anonymous/GBKDHP.html

4.1.2 Results and analysis

Segmentation results are evaluated on the PASCAL evaluation server. Example results are shown in Fig. 1, where results of ShelfNet with ResNet50 and ResNet101 as the backbone are presented. The detailed results are summarized in Table 1 and Table 2. For a fair comparison, we implemented ShelfNet and several state-of-the-art segmentation models with PyTorch and measured their inference speed on a single GTX 1080Ti GPU. ShelfNet with ResNet50 backbone and ResNet101 backbone are named as ShelfNet50 and ShelfNet101 for short respectively. When trained only on augmented PASCAL training set and fine-tuned on original PASCAL VOC dataset, ShelfNet50 achieves a mIoU of 79.0% and ShelfNet101 achieves a mIoU of 81.1%. When trained on both MS COCO and PASCAL dataset, ShelfNet50 and ShelfNet101 achieve 82.8% mIoU and 84.2% mIoU respectively. Compared to state-of-the-art semantic segmentation models such as PSPNet [43] and EncNet [41], ShelfNet achieves a comparable mIoU but generates 4 to 5 times speed-up during inference (59 FPS for ShelfNet50 and 42 FPS for ShelfNet101, 11 FPS for PSPNet and 12 FPS for EncNet).

Lightweight-RefineNet [27] is based on RefineNet [19], and achieves the highest accuracy with fast inference speed in the literature. Comparisons between our ShelfNet and Lightweight-RefineNet are summarized in Table 3. ShelfNet with a ResNet50 backbone achieves higher accuracy (82.8%) than Lightweight-RefineNet with a ResNet 152 backbone (82.7%) and RefineNet with a ResNet101 backbone (82.4%). Compared to RefineNet and Lightweight-RefineNet, the better performance with a much smaller backbone of ShelfNet validates the efficiency of the proposed shelf-like structure in feature extraction. Our ShelfNet with Resnet101 backbone achieves the highest accuracy (84.2%) compared to all RefineNet and Lightweight-RefineNet models. Besides the higher accuracy, ShelfNet achieves faster inference speed compared with Lightweight-RefineNet with the same backbone.

4.2 PASCAL-Context

PASCAL-Context dataset [26] provides dense labels for the whole image with 59 classes and a background class. There are 4,998 training images and 5,105 test images. The model is trained with SGD optimizer for 80 epochs with cross-entropy loss. The base learning rate is set as 0.001. No extra training data is used in this experiment and other hyper-parameters are the same as in section 4.1 .

Examples of ShelfNet on PASCAL-Context test set are shown in Fig. 5. The detailed results are summarized in Table 4. DeepLab-v2 achieves 45.7% mIoU with MS COCO as extra training data, while our ShelfNet achieves 45.6% and 48.4% with ResNet50 and ResNet101 respectively without extra training data. RefineNet achieves 47.3% mIoU at the speed of 29 FPS, while our ShelfNet achieves 45.6% mIoU at 59 FPS with ResNet50 backbone, and 48.4% mIoU at 42 FPS. ShelfNet has both higher accuracy and faster running speed compared with RefineNet. EncNet achieves a higher mIoU of 51.7%; this is because EncNet uses dilated convolution and sacrifice the inference speed. The inference speed of ShelfNet is 4 to 5 times faster than EncNet as shown in Table 2. Overall, our ShelfNet achieves high mIoU with fast inference speed.

4.3 Cityscapes

Cityscapes [5] consists of images for 50 cities in different seasons and are annotated with 19 categories. It contains 2975, 500 and 1525 fine-labeled images for training, validation and test respectively. More than 20,000 images with coarse annotations are also provided. In training phase, images are augmented in the same way as section 4.1 and cropped into size 768×768. The model is trained on the coarse-labeled dataset for 30 epochs with a base learning rate of 0.01 with a batch size of 6, then trained on the fine-labeled dataset for 500 epochs with a base learning rate of 0.005. Other training parameters are the same as in section 4.1. All results are evaluated on the Cityscapse evaluation server.

The results are summarized in Fig. 6 and Table 5. ShelfNet with ResNet50 backbone achieves 72.4% mIoU when trained with the fine-labeled dataset only, and achieves 75.8% mIoU when trained with both fine-labeled and coarse-labeled datasets. ShelfNet achieved 74.4% mIoU with ResNet101 when trained with fine-labeled dataset only. ShelfNet achieves the second highest mIoU. PSPNet achieves higher mIoU than our ShelfNet, however, ShelfNet is 4 to 5 time faster than PSPNet. Therefore, PSPNet is not suitable for real-time semantic segmentation, while ShelfNet achieves high mIoU at high speed. Compared with other real-time semantic segmentation models such as ENet [30] and ICNet [44], ShelfNet achieves a more than 6% higher mIoU. Therefore, considering the balance between inference speed and segmentation accuracy, ShelfNet performs the best for real-time semantic segmentation.

5 Conclusion

We proposed ShelfNet for real-time semantic segmentation, which has multiple pairs of encoder-decoder branches with skip connections between adjacent branches. The special structure of ShelfNet provides a much larger number of paths for information flow. We validated the high segmentation accuracy and fast running speed on three benchmark datasets. ShelfNet achieves comparable segmentation accuracy to state-of-the-art off-line models, and a 4 to 5 times faster inference speed. We will publish the implementation after the decision of acceptance for blind review. We hope that ShelfNet will provide a new insight into the shelf-shaped structure, and our implementation will benefit works on semantic segmentation.

References

  • [1] M. Z. Alom et al. Inception recurrent convolutional neural network for object recognition. 2017.
  • [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
  • [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2016). arXiv preprint arXiv:1606.00915, 2016.
  • [4] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [7] J. Deng et al. Imagenet: A large-scale hierarchical image database. In CVPR 2009., 2009.
  • [8] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
  • [9] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, pages 519–534. Springer, 2016.
  • [10] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
  • [12] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. 2011.
  • [13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
  • [14] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
  • [15] K. He et al. Deep residual learning for image recognition. In CVPR, 2016.
  • [16] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
  • [17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [18] Y. LeCun et al. Deep learning. nature, 2015.
  • [19] G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Cvpr, volume 1, page 5, 2017.
  • [20] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3194–3203, 2016.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [22] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
  • [23] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1385, 2015.
  • [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [25] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
  • [26] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
  • [27] V. Nekrasov, C. Shen, and I. Reid. Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272, 2018.
  • [28] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
  • [29] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016.
  • [30] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  • [33] O. Ronneberger et al. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [34] D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19:221–248, 2017.
  • [35] A. Veit et al. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, 2016.
  • [36] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellapa. Gaussian conditional random field network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3233, 2016.
  • [37] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in neural information processing systems, pages 809–817, 2013.
  • [38] B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1887–1893. IEEE, 2018.
  • [39] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016.
  • [40] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [41] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [42] L. Zhang, L. Zhang, and B. Du. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2):22–40, 2016.
  • [43] H. Zhao et al. Pyramid scene parsing network. In (CVPR), 2017.
  • [44] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017.
  • [45] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015.
  • [46] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4. IEEE, 2017.