In this project, we present ShelfNet, a lightweight convolutional neuralnetwork for accurate real-time semantic segmentation. Different from thestandard encoder-decoder structure, ShelfNet has multiple encoder-decoderbranch pairs with skip connections at each spatial level, which looks like ashelf with multiple columns. The shelf-shaped structure provides multiple pathsfor information flow and improves segmentation accuracy. Inspired by thesuccess of recurrent convolutional neural networks, we use modified residualblocks where two convolutional layers share weights. The shared-weight blockenables efficient feature extraction and model size reduction. We testedShelfNet with ResNet50 and ResNet101 as the backbone respectively: theyachieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a 512x512input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, itachieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset.ShelfNet achieved both higher mIoU and faster inference speed compared withstate-of-the-art real-time semantic segmentation models. We provide theimplementation https://github.com/juntang-zhuang/ShelfNet.
Quick Read (beta)
ShelfNet for Real-time Semantic Segmentation
In this project, we present ShelfNet, a lightweight convolutional neural network for accurate real-time semantic segmentation. Different from the standard encoder-decoder structure, ShelfNet has multiple encoder-decoder branch pairs with skip connections at each spatial level, which looks like a shelf with multiple columns. The shelf-shaped structure provides multiple paths for information flow and improves segmentation accuracy. Inspired by the success of recurrent convolutional neural networks, we use modified residual blocks where two convolutional layers share weights. The shared-weight block enables efficient feature extraction and model size reduction. We tested ShelfNet with ResNet50 and ResNet101 as the backbone respectively: they achieved 59 FPS and 42 FPS respectively on a GTX 1080Ti GPU with a input image. ShelfNet achieved high accuracy: on PASCAL VOC 2012 test set, it achieved 84.2% mIoU with ResNet101 backbone and 82.8% mIoU with ResNet50 backbone; it achieved 75.8% mIoU with ResNet50 backbone on Cityscapes dataset. ShelfNet achieved both higher mIoU and faster inference speed compared with state-of-the-art real-time semantic segmentation models. We provide the implementation https://github.com/juntang-zhuang/ShelfNet.
Semantic segmentation is the key to image understanding [8, 26], and is related to other tasks such as scene parsing, object detection and instance segmentation[21, 46]. The task of semantic segmentation is to assign each pixel a unique class label, and can be viewed as a dense classification problem. Deep learning has achieved state-of-the-art performance in many computer vision tasks, such as image classification, semantic segmentation, object recognition, motion tracking and image captioning [18, 7, 15]. Recently many convolutional neural networks (CNN) have achieved remarkable results on semantic segmentation tasks, and the success of deep learning models in semantic segmentation has benefited a wide range of related fields, including medical image analysis and diagnosis , remote sensing , visual tracking and video surveillance  and auto driving .
However, the success of most deep learning models for semantic segmentation comes at a price of heavy computation burden. The training of CNNs on a large dataset such as PASCAL VOC , Cityscapes  and ADE20K  typically takes several days on a single GPU, and the running time during test phase is usually hundreds of milliseconds (ms) or more, which hinders their application in tasks of real-time processing. In this project, we focus on the real-time semantic segmentation problem, and propose a method that can achieve both fast running speed and high segmentation accuracy.
There has been much research on removing redundancy of deep neural networks for faster speed, such as pruning [10, 11, 14] and distillation [29, 17, 32]. However, most of them require pre-training of a large model on a large dataset; furthermore, the running speed is typically insufficient for real-time semantic segmentation. Another way to get faster running speed is to reduce the number of channels in the model, however this method typically yields lower accuracy. Therefore, we aim to propose a new architecture of CNNs for accurate real-time segmentation.
Most state-of-the-art semantic segmentation models belong to the family of “encoder-decoder” structure, where the image is progressively down-sampled then up-sampled. The up-sample process can use skip connections from the down-sample process or not: FCN  and DeepLab  has an encoder-decoder structure without skip connections; U-Net  and DeeLab v3  fuses multi-scale features from both the down-sample and up-sample process to utilize both high-level and low-level features.
However, there has been little effort on networks that are different from a standard encoder-decoder structure. In this project, we propose ShelfNet with a structure like a multi-column shelf instead of one encoder-decoder pair, and demonstrate its superior performance in real-time semantic segmentation. We argue that the special structure of ShelfNet enables efficient information flow, and demonstrate its high accuracy and fast running speed on PASCAL VOC, PASCAL Context and Cityscapes datasets. Our main contributions are listed as follows:
1. We propose a multi-path convolutional neural network (ShelfNet, Fig. 2), which has a structure in the shape of a shelf with multiple columns. Different from the standard encoder-decoder strucutre, ShelfNet has multiple encoder-decoder pairs, with skip connections at each spatial resolution level. The unique structure of ShelfNet greatly increases the number of paths from input to output, and improves information flow in the network.
2. We propose an efficient modification of residual block. Inspired by the success of recurrent convolutional neural network , we propose to use shared weights of two convolutional layers in a residual block. The shared-weights design enables more efficient feature extraction and reduces the size of the model. Furthermore, we add a drop-out layer between two convolutional layers in a residual block to avoid overfitting.
3. We validate the performance of ShelfNet on various benchmark datasets, including PASCAL VOC , PASCAL Context  and Cityscapes . During test phase on a GTX 1080Ti GPU with a input image, ShelfNet achieves 59 FPS and 42 FPS with ResNet50 and ResNet101 backbone respectively. ShelfNet achieves both high running speed and high accuracy: on PASCAL VOC 2012 test set, ShelfNet achieves a mean intersection over union (mIoU) of 84.2% with ResNet101 backbone, and achieves 82.8% mIoU with ResNet 50 backbone; ShelfNet with ResNet50 backbone achieves 75.8% mIoU on Cityscapes test set, and 45.6% mIoU on PASCAL Context test set.
2 Related Work
2.1 Semantic segmentation
Semantic segmentation has been a hot topic for many years. Before the recent rise of deep learning, early approaches mainly relied on handcrafted features such as HOG  and SIFT . The features are then fed into classifiers such as SVM  and random forest classifier. These methods cannot be trained end-to-end, and the performance of models heavily rely on the design of handcrafted features. Since the resurgence of deep learning, especially fully convolutional neural networks (FCN) , deep learning models have been widely used for semantic segmentation. The end-to-end training enables the neural network to learn complicated features automatically without handcrafted features, and achieve much higher accuracy than traditional semantic segmentation algorithms.
FCN has an encoder-decoder structure, where the image is gradually spatially down-sampled then up-sampled to generate a segmentation map. Since the success of FCN, many convolutional neural networks with an encoder-decoder structure have emerged. U-Net  is a widely used model for medical image segmentation, it has an encoder-decoder structure with skip connections from down-sample branch (encoder branch) to up-sample branch (decoder branch) at different spatial resolutions; the special design of U-Net enables fusion of low-level and high-level features and helps to generate high accuracy. RefineNet  also has an encoder-decoder structure, where the decoder has a special module called “Chained Residual Pooling” to enable multi-path information flow.
Both U-Net and RefineNet use a “convolution-pooling” strategy for the encoder. The pooling layer reduces spatial resolution and harms prediction accuracy. To overcome this problem, Chen et al. proposed DeepLab  based on dilated convolution, where the receptive field of kernel is dilated as shown in Fig. 3. For example, a standard convolutional layer with a kernel size of 3 takes a square (spatial) as input to compute a feature, while a dilated convolutional layer with a dilation rate of 2 takes a square (spatial) as input, but only uses pixels at corners, edge center, and center of the square (9 pixels in total). Instead of pooling, a dilated CNN gradually increases dilation rate to increase the size of receptive field, but does not shrink the size of output tensor. Therefore, dilated CNN has a better spatial resolution compared with “conv-pool” strategy. For example, a standard ResNet shrinks image size to of input size, while a dilated ResNet shrinks image size to of input size. Other networks, such as PSPNet  and EncNet  are based on dilated CNN, but considers the scene parsing or context information for more accurate segmentation.
2.2 Real-time semantic segmentation
State-of-the-art semantic segmentation models suffer from a long running time. The success of DeepLab v3, PSPNet and EncNet based on dilated CNN comes at a price of heavy computational burden. Compared with the “conv-pool” strategy, the dilated CNN outputs a tensor with the same channel number but a much larger spatial size, therefore the running speed is significantly reduced. Therefore, networks based on dilated CNNs are not suitable for real-time semantic segmentation. Other strategies rely on refinement of prediction based on traditional models such as conditional random field (CRF), which are also computationally expensive and hard to deploy on a GPU, hence unsuitable for real-time applications.
There have been several approaches for real-time semantic segmentation by modifying a large network to a light-weight version. For example, ICNet  is a modification of PSPNet, and deals with multiple images scales. ICNet achieves 30 FPS on a image with a 70.6% mIoU on Cityscapes test set, but the robustness to low-resolution is not extensively validated. Light-Weight RefineNet is a modification of RefineNet , where the kernel sizes of some convolutional layers are reduced from to . Similar to our strategy, Light-Weight RefineNet also reduces the channel number of outputs from ResNet for faster running speed. On PASCAL VOC 2012 test set, Light-Weight RefineNet achieves 81.1% mIoU and 55 FPS with ResNet50 backbone, and achieves 82.7% mIoU and 32 FPS with ResNet152 backbone. Other real-time segmentation models achieve high running speed at the cost of accuracy, SegNet  achieves 40 FPS on a image with 57.0% mIoU on Cityscapes, and ENet  achieves 20 FPS on a image but only achieves 58.3% mIoU on Cityscapes. All models mentioned in this paragraph can be viewed as modifications of the encoder-decoder structure. In this project, we propose a network with multiple encoder-decoder pairs and skip connections at different spatial levels (e.g. A-D in Fig. 2), and demonstrate its superior performance over previous methods both in inference speed and segmentation accuracy.
3.1 Structure of ShelfNet
We propose ShelfNet, a multi-branch convolutional neural network for semantic segmentation as shown in Fig. 2. Features in different spatial scales are named with letters A to D, and columns are named with numbers 1 to 4. We name column 1 and 3 as encoder branches (down-sample branch), and name column 2 and 4 as decoder branches (up-sample branch). We use convolution with a stride of 2 in encoder branches, and use transposed convolution with a stride of 2 in decoder branches. The number of channels is doubled and the spatial size is reduced by half from a low level to a higher level (e.g., A to B). We use ResNet as the backbone for ShelfNet in this project. To reduce the number of channels for faster inference speed, we use a convolutional layer followed by a batch-normalization and a relu layer to convert the number of channels from 256, 512, 1024, 2048 (outputs from backbone) into 64, 128, 256, 512 for levels A-D respectively. The output tensor from block goes through a convolutional layer and a softmax operation to generate predictions.
3.2 ShelfNet as a chain of SegNets
SegNet  has a convolutional encoder-decoder structure with skip connections between down-sample and up-sample branches. Here we show ShelfNet can be viewed as a chain of modified SegNets. Looking at only branches 3 and 4 in Fig. 2, it has a similar structure as SegNet, except that outputs from down-sample branch and up-sample branch are summed up in ShelfNet, but concatenated in SegNet. Ignoring the difference in the backbone, branches 1 and 2 can be viewed as another SegNet. The two sub-SegNets are connected at levels A-C between branches 2 and 3. Similar to the structure in Fig. 2, we can add another pair of down-sample and up-sample branches (denoted as branches 5 and 6) after branch 4, with skip connections between branches 4 and 5 to generate a more complicated ShelfNet.
3.3 ShelfNet as an ensemble of FCNs
ShelfNet can be viewed as an ensemble of FCNs. Andreas et al.  argued that ResNet behaves like an ensemble of shallow networks, because the residual connections provide multiple paths for efficient information flow. Similarly, ShelfNet provides multiple paths of information flow. For ease of representation, we denote backbone as column 0 and list a few paths here as an example as shown in Fig. 4: (1) (Blue line in Fig. 4), (2) (Green line in Fig. 4) , (3) (Red line in Fig. 4), (4) (Orange line in Fig. 4). Each path can be viewed as a variant of FCN (except that there are pooling layers in ResNet backbone). Therefore, ShelfNet has the potential to capture more complicated features and produce higher accuracy.
The effective number of FCN paths in ShelfNet is much larger compared to SegNet. The total number of paths grows exponentially with the number of encoder-decoder pairs (e.g column 1 and 2, 3 and 4 are two pairs) and the number of spatial levels (e.g., A to D in Fig. 2). Not considering the effective paths generated from residual connections in ResNet, for a SegNet with 4 spatial levels (A-D), the total number of FCN paths is 4; for a ShelfNet with the same spatial levels, the total number of FCN paths is 29. The special structure of ShelfNet greatly increases the number of effective FCN paths, thus generating higher segmentation accuracy.
3.4 Shared-weights residual block
Compared with SegNet, the larger effective number of FCN paths comes at a price of extra blocks. To reduce the model size and extract features more efficiently, we propose a modified residual block as shown in Fig. 2 (b). The two convolutional layers in the same block share the same weights, but the two batch normalization layers are different. The shared-weights design reuses weights of convolution, and has similar features as the recurrent convolutional neural network (RCNN) . A drop-out layer is added between two convolutional layers to avoid overfitting. The shared-weights residual block combines the strength of skip connection, recurrent convolution and drop-out regularization, and has much fewer parameters than a standard residual block.
|ShelfNet50 + coarse (ss)||98.4||84.5||92.2||46.4||58.4||62.4||71.9||76.3||93.2||71.1||95.2||84.2||66.1||95.3||57.4||71.6||57.7||65.1||74.4||74.8|
|ShelfNet50 + coarse||98.5||85.2||92.5||46.9||59.9||63.9||73.6||77.8||93.4||72.4||95.4||85.4||67.9||95.6||58.5||72.0||58.9||67.2||75.9||75.8|
4 Experiments and results
We carried out extensive experiments to validate the fast inference speed and high accuracy of ShelfNet on several different datasets, including PASCAL VOC 2012, PASCAL Context and Cityscapes. Performance is measured by mean intersection over union (mIoU). The training strategies are slightly different for different tasks; we report them in detail in the following section.
4.1 PASCAL VOC 2012
4.1.1 Implementation details
PASCAL VOC 2012  contains 20 object classes with one background class. PASCAL VOC dataset is split into a training set, a validation set and a test set, with 1464, 1449 and 1456 images respectively. We use the augmented PASCAL VOC dataset  containing 10582, 1449 and 1456 images for training, validation and test set. MS COCO  is also used as extra training data to generate higher accuracy.
All models are implemented with PyTorch  0.4.1. We use a ResNet pretrained on ImageNet  as backbone. We use ResNet with BottleNeck block instead of BasicBlock to capture deeper features of the model and generate more accurate results. Learning rate is scheduled in the form , and cross-entropy loss is used. The weight-decay is set as 1e-4. The model is first trained with Stochastic Gradient Descent (SGD) optimizer on MS COCO dataset for 30 epochs with a base learning rate of 0.01, then trained on PASCAL augmented dataset for 50 epochs with a base learning rate of 0.01, and finally fine-tuned on original PASCAL VOC dataset for 50 epochs with a base learning rate of 0.001. For data augmentation, the image is randomly flipped and scaled between 0.5 to 2, and randomly rotated between -10 and 10 degrees. The image is cropped into size for training and the batch size is set as 12. The results are evaluated on the PASCAL evaluation server. We provide anonymous links to our results in the footnote. 11 1 http://host.robots.ox.ac.uk:8080/anonymous/5NMB0K.html 22 2 http://host.robots.ox.ac.uk:8080/anonymous/LTORA3.html 33 3 http://host.robots.ox.ac.uk:8080/anonymous/J5UUKK.html 44 4 http://host.robots.ox.ac.uk:8080/anonymous/GBKDHP.html
4.1.2 Results and analysis
Segmentation results are evaluated on the PASCAL evaluation server. Example results are shown in Fig. 1, where results of ShelfNet with ResNet50 and ResNet101 as the backbone are presented. The detailed results are summarized in Table 1 and Table 2. For a fair comparison, we implemented ShelfNet and several state-of-the-art segmentation models with PyTorch and measured their inference speed on a single GTX 1080Ti GPU. ShelfNet with ResNet50 backbone and ResNet101 backbone are named as ShelfNet50 and ShelfNet101 for short respectively. When trained only on augmented PASCAL training set and fine-tuned on original PASCAL VOC dataset, ShelfNet50 achieves a mIoU of 79.0% and ShelfNet101 achieves a mIoU of 81.1%. When trained on both MS COCO and PASCAL dataset, ShelfNet50 and ShelfNet101 achieve 82.8% mIoU and 84.2% mIoU respectively. Compared to state-of-the-art semantic segmentation models such as PSPNet  and EncNet , ShelfNet achieves a comparable mIoU but generates 4 to 5 times speed-up during inference (59 FPS for ShelfNet50 and 42 FPS for ShelfNet101, 11 FPS for PSPNet and 12 FPS for EncNet).
Lightweight-RefineNet  is based on RefineNet , and achieves the highest accuracy with fast inference speed in the literature. Comparisons between our ShelfNet and Lightweight-RefineNet are summarized in Table 3. ShelfNet with a ResNet50 backbone achieves higher accuracy (82.8%) than Lightweight-RefineNet with a ResNet 152 backbone (82.7%) and RefineNet with a ResNet101 backbone (82.4%). Compared to RefineNet and Lightweight-RefineNet, the better performance with a much smaller backbone of ShelfNet validates the efficiency of the proposed shelf-like structure in feature extraction. Our ShelfNet with Resnet101 backbone achieves the highest accuracy (84.2%) compared to all RefineNet and Lightweight-RefineNet models. Besides the higher accuracy, ShelfNet achieves faster inference speed compared with Lightweight-RefineNet with the same backbone.
PASCAL-Context dataset  provides dense labels for the whole image with 59 classes and a background class. There are 4,998 training images and 5,105 test images. The model is trained with SGD optimizer for 80 epochs with cross-entropy loss. The base learning rate is set as 0.001. No extra training data is used in this experiment and other hyper-parameters are the same as in section 4.1 .
Examples of ShelfNet on PASCAL-Context test set are shown in Fig. 5. The detailed results are summarized in Table 4. DeepLab-v2 achieves 45.7% mIoU with MS COCO as extra training data, while our ShelfNet achieves 45.6% and 48.4% with ResNet50 and ResNet101 respectively without extra training data. RefineNet achieves 47.3% mIoU at the speed of 29 FPS, while our ShelfNet achieves 45.6% mIoU at 59 FPS with ResNet50 backbone, and 48.4% mIoU at 42 FPS. ShelfNet has both higher accuracy and faster running speed compared with RefineNet. EncNet achieves a higher mIoU of 51.7%; this is because EncNet uses dilated convolution and sacrifice the inference speed. The inference speed of ShelfNet is 4 to 5 times faster than EncNet as shown in Table 2. Overall, our ShelfNet achieves high mIoU with fast inference speed.
Cityscapes  consists of images for 50 cities in different seasons and are annotated with 19 categories. It contains 2975, 500 and 1525 fine-labeled images for training, validation and test respectively. More than 20,000 images with coarse annotations are also provided. In training phase, images are augmented in the same way as section 4.1 and cropped into size . The model is trained on the coarse-labeled dataset for 30 epochs with a base learning rate of 0.01 with a batch size of 6, then trained on the fine-labeled dataset for 500 epochs with a base learning rate of 0.005. Other training parameters are the same as in section 4.1. All results are evaluated on the Cityscapse evaluation server.
The results are summarized in Fig. 6 and Table 5. ShelfNet with ResNet50 backbone achieves 72.4% mIoU when trained with the fine-labeled dataset only, and achieves 75.8% mIoU when trained with both fine-labeled and coarse-labeled datasets. ShelfNet achieved 74.4% mIoU with ResNet101 when trained with fine-labeled dataset only. ShelfNet achieves the second highest mIoU. PSPNet achieves higher mIoU than our ShelfNet, however, ShelfNet is 4 to 5 time faster than PSPNet. Therefore, PSPNet is not suitable for real-time semantic segmentation, while ShelfNet achieves high mIoU at high speed. Compared with other real-time semantic segmentation models such as ENet  and ICNet , ShelfNet achieves a more than 6% higher mIoU. Therefore, considering the balance between inference speed and segmentation accuracy, ShelfNet performs the best for real-time semantic segmentation.
We proposed ShelfNet for real-time semantic segmentation, which has multiple pairs of encoder-decoder branches with skip connections between adjacent branches. The special structure of ShelfNet provides a much larger number of paths for information flow. We validated the high segmentation accuracy and fast running speed on three benchmark datasets. ShelfNet achieves comparable segmentation accuracy to state-of-the-art off-line models, and a 4 to 5 times faster inference speed. We will publish the implementation after the decision of acceptance for blind review. We hope that ShelfNet will provide a new insight into the shelf-shaped structure, and our implementation will benefit works on semantic segmentation.
-  M. Z. Alom et al. Inception recurrent convolutional neural network for object recognition. 2017.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs (2016). arXiv preprint arXiv:1606.00915, 2016.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
-  J. Deng et al. Imagenet: A large-scale hierarchical image database. In CVPR 2009., 2009.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
-  G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, pages 519–534. Springer, 2016.
-  S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
-  S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
-  B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. 2011.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
-  B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pages 164–171, 1993.
-  K. He et al. Deep residual learning for image recognition. In CVPR, 2016.
-  M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  Y. LeCun et al. Deep learning. nature, 2015.
-  G. Lin, A. Milan, C. Shen, and I. D. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Cvpr, volume 1, page 5, 2017.
-  G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3194–3203, 2016.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
-  Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision, pages 1377–1385, 2015.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891–898, 2014.
-  V. Nekrasov, C. Shen, and I. Reid. Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272, 2018.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016.
-  A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  O. Ronneberger et al. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
-  D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image analysis. Annual review of biomedical engineering, 19:221–248, 2017.
-  A. Veit et al. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, 2016.
-  R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellapa. Gaussian conditional random field network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3233, 2016.
-  N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in neural information processing systems, pages 809–817, 2013.
-  B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1887–1893. IEEE, 2018.
-  Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080, 2016.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  L. Zhang, L. Zhang, and B. Du. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2):22–40, 2016.
-  H. Zhao et al. Pyramid scene parsing network. In (CVPR), 2017.
-  H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017.
-  S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 4. IEEE, 2017.