ScarfNet: Multi-scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

  • 2019-08-01 11:07:17
  • Jin Hyeok Yoo, Seong Hyeon Park, Jun Won Choi
  • 17

Abstract

Convolutional neural network (CNN) has led to significant progress in objectdetection. In order to detect the objects in various sizes, the objectdetectors often exploit the hierarchy of the multi-scale feature maps calledfeature pyramid, which is readily obtained by the CNN architecture. However,the performance of these object detectors is limited since the bottom-levelfeature maps, which experience fewer convolutional layers, lack the semanticinformation needed to capture the characteristics of the small objects. Inorder to address such problem, various methods have been proposed to increasethe depth for the bottom-level features used for object detection. While mostapproaches are based on the generation of additional features through thetop-down pathway with lateral connections, our approach directly fusesmulti-scale feature maps using bidirectional long short term memory (biLSTM) ineffort to generate deeply fused semantics. Then, the resulting semanticinformation is redistributed to the individual pyramidal feature at each scalethrough the channel-wise attention model. We integrate our semantic combiningand attentive redistribution feature network (ScarfNet) with baseline objectdetectors, i.e., Faster R-CNN, single-shot multibox detector (SSD) andRetinaNet. Our experiments show that our method outperforms the existingfeature pyramid methods as well as the baseline detectors and achieve the stateof the art performances in the PASCAL VOC and COCO detection benchmarks.

 

Quick Read (beta)

ScarfNet: Multi-scale Features with Deeply Fused and Redistributed Semantics for Enhanced Object Detection

Jin Hyeok Yoo, Seong Hyeon Park and Jun Won Choi
Department of Electrical Engineering, Hanyang University
[email protected], [email protected], [email protected]
Abstract

Convolutional neural network (CNN) has led significant progress in object detection. In order to detect the objects in various sizes, the object detectors often exploit the hierarchy of the multi-scale feature maps called feature pyramid, which is readily obtained by the CNN architecture. However, the performance of these object detectors is limited since the bottom-level feature maps, which experience fewer convolutional layers, lack the semantic information needed to capture the characteristics of the small objects. In order to address such problem, various methods have been proposed to increase the depth for the bottom-level features used for object detection. While most approaches are based on the generation of additional features through the top-down pathway with lateral connections, our approach directly fuses multi-scale feature maps using bidirectional long short term memory (biLSTM) in effort to generate deeply fused semantics. Then, the resulting semantic information is redistributed to the individual pyramidal feature at each scale through the channel-wise attention model. We integrate our semantic combining and attentive redistribution feature network (ScarfNet) with the baseline object detectors, i.e., Faster R-CNN, single-shot multibox detector (SSD) and RetinaNet. Our experiments show that our method outperforms the existing feature pyramid methods as well as the baseline detectors and achieve the state of the art performances in the PASCAL VOC and COCO detection benchmarks.

1 Introduction

Object detection refers to the task of deciding whether or not there are any instances of objects in the image and return the location and category of the objects [17], [15]. Historically, object detection has been one of the most challenging computer vision problems. Recently, deep learning has led an unprecedented advance in object detection techniques [15]. Convolutional neural network (CNN) can produce the hierarchy of abstract feature maps through a cascade of convolution operations followed by the nonlinear function. Using the CNN as a backbone network, the object detectors can effectively infer the location of the bounding box and the category of the instances based on the abstract feature maps. Thus far, various object detection network structures have been proposed in the literature. The CNN-based object detectors are roughly categorized into two groups: two-stage detectors and single-stage detectors. The two-stage detectors detect the objects using two separate networks; 1) the region proposal network for finding the bounding boxes containing the object and 2) the object classifier network for identifying the class of the objects. The well-known two-stage detectors include R-CNN [6], Fast R-CNN [7], Faster R-CNN [20], and Mask R-CNN [8]. On the other hand, the single-stage detectors directly estimate the bounding boxes and the object classes from the feature maps in one shot. The single-stage detectors include SSD [16], YOLO [18], YOLOv2 [19], and RetinaNet [13].

The key ingredient of the recent advances in object detection is due to the CNN’s capability to produce the abstract features containing strong semantic cue. The deeper the convolutional layers are, the higher the level of abstraction is achieved for the resulting feature maps. As a result, the features produced at the end of the CNN pipeline (called top-level features) contain rich semantics but lack spatial resolution while the features placed at the input layers (called bottom-level features) lack semantic information but have detailed spatial information. The hierarchy of such multi-scale features constitutes the so-called feature pyramid, which is used to detect the objects of different scales in many object detectors (e.g. SSD [16], MS-CNN [2], and RetinaNet [13]). The structure for using such feature pyramid for object detection is described in Fig. 1 (a). Note that the attributes of the large objects tend to be captured on the top-level features while those of the small objects are well represented by the shallow bottom-level features.

One limitation of the aforementioned feature pyramid method is the disparity of the semantic information between the multi-scale feature maps used for object detection. The bottom-level features are not deep enough to exhibit high-level semantics underlying in the objects and their surroundings. This results in the accuracy loss in detecting the small objects. In order to address this problem, several approaches have been proposed, which attempted to reduce the semantic gap between the different scales. One notable direction is to provide the contextual information to the bottom-level features by generating the highly semantic features in the top-down pathway with latent connections. As illustrated in Fig. 1 (c), based on the top-level pyramidal feature obtained from the bottom-up network, the additional features are generated with the increased depth and resolution. In order to avoid losing the spatial information, lateral connections are used to bring the low-level bottom layer features and combine them with the high-level semantic features. Various object detectors including DSSD [5], FPN [12], and StairNet [22] follow this principle and significant improvement has been reported in terms of detection accuracy.

Our work is motivated by the observation that the current architectures for generating top-down features might not be flexible enough to generate strong semantics for all scales. Thus, we propose a new framework for generating the deeply fused semantics for the multi-scale features for enhanced object detection. The proposed feature pyramid method, referred to as semantic combining and attentive redistribution feature network (ScarfNet), fuses the multi-scale feature maps using the recurrent neural network and produces the new multi-scale feature maps by redistributing the learned semantics to each level. The structure of our ScarfNet is depicted in Fig. 2 (d). First, we fuse the multi-scale pyramidal features using the bidirectional long short term memory (biLSTM) [24]. Note that the biLSTM has the advantage in fusing the multi-scale features in that the number of the required weights is significantly reduced by the parameter sharing and the only relevant semantic information is selectively aggregated through the gating function of the biLSTM. The outputs of the biLSTM are concatenated and distributed through the channel-wise attention model to generate highly semantic features tailored for each pyramid scale. The final multi-scale feature maps for object detection are obtained by concatenating the output of the ScarfNet with the original pyramidal features. Note that our framework can be readily applied to various CNN architectures which are desperate for feature pyramid with strong semantics.

In our experiments, we integrate our ScarfNet to the baseline detectors, Faster R-CNN [20], SSD [16] and RetinaNet [13]. Our evaluation conducted over PASCAL VOC [4] and MS COCO [14] datasets shows that our method offers significant improvement over the baseline detectors as well as other competitive detectors in terms of detection accuracy. Furthermore, the proposed ScarfNet-based RetinaNet achieves the state of the art performance in PASCAL VOC [4] and COCO [14] detection benchmarks. Our code will be publicly available. The contributions of our paper are summarized as follows

  • We introduce a new deep architecture for closing the semantic gaps between the multi-scale feature maps. The proposed ScarfNet generates the new multi-scale feature maps with the deeply fused and redistributed semantics. This is achieved by using the combination of the biLSTM and channel-wise attention model.

  • We are the first to use biLSTM to combine the multi-scale features to incorporate strong semantics for the feature pyramid. The biLSTM can produce the deeply fused semantic information using the recurrent connection over the different pyramid scales. Furthermore, our ScarfNet benefits from the selective information gating mechanism inherent in the biLSTM. Due to parameter sharing, the overhead due to ScarfNet is small. In addition, our ScarfNet is easy to train and end-to-end trainable.

Figure 1: Structure of several feature pyramid methods: In (a), the feature pyramid obtained from convolutional layers is used in the baseline detectors (e.g., SSD [16]). In (b), the multi-scale features are fused and converted into the single semantic feature map with the highest resolution. (c) shows the structure generating additional features in unidirectional way through the top-down structure with lateral connections. (d) shows the structure of the proposed ScarfNet, where the multi-scale features are fused in a bidirectional fashion and the learned semantics are propagated back to each scale.

2 Related Work

In this section, we review the basic object detectors and several existing feature pyramid methods for decreasing the semantic gap between the scales.

2.1 CNN-based Object Detectors

Recently, CNN has brought an order of magnitude performance improvement in object detection. Thus far, various CNN-based object detectors have been proposed. The current object detectors can be categorized into two groups: two-stage detectors and single-stage detectors. The two-stage detectors detect the objects in two steps; finding the region proposals based on the objectness of the regions and conducting the classification and bound regression for the detected region proposals. The R-CNN [6] is the first CNN-based detector where the traditional selective search is employed to find the region proposals and the CNN is applied to the image patch in each region proposal. The fast RCNN [7] and the faster RCNN [20] reduced the computation time of the R-CNN by using the region of interest (ROI) pooling for using full image feature maps and replacing the selective search with the region proposal network (RPN). The single-stage detectors directly perform classification and box regression based on the feature maps. These detectors compute the confidence score on the object category and the regression results for the candidate boxes while sweeping the feature maps spatially. The well-known single-stage detectors include SSD [16], YOLO [18], and YOLOv2 [19]. Recently, RetinaNet [13] has achieved the state of the art performance using the ResNet [9] as a backbone and the various latest training tricks. Refer to [15] for the comprehensive review of the contemporary object detectors.

2.2 Object Detectors Using Multi-scale Features

Several object detectors including SSD [16] and RetinaNet [13] rely on the hierarchical feature pyramid to detect the objects of various sizes (see Fig. 1 (a)). One issue arising in using the multi-scale features directly produced by the CNN is the gap of the semantic information between them caused by the different depths of the layers passed by the input. Due to the relatively low level of abstraction for bottom-level features, detection accuracy for the small objects is often limited. Fig. 1 (b), (c), and (d) describe several strategies that have been proposed to overcome the aforementioned issue. Fig. 1 (b) depicts the strategy of combining the multi-scale features into the single high resolution feature map with strong semantics. HyperNet [11] and ION [1] improved the performance of the RPN by aggregating the hierarchical features with the appropriate resizing of the feature maps. Fig. 1 (c) shows the strategy of generating highly semantic features through the top-down pathway with lateral connections. Note that the semantic information is brought through top-down connections while the detailed spatial information is delivered through the lateral connections. Several detectors based on this structure include DSSD [5], StairNet [22], TDM [21], FPN [12], and RefineDet [25]. DSSD [5] and StairNet [22] use the deconvolutional layer-based top-down connections for the SSD baseline [16]. TDM [21] employs the top-down structure specified for the RPN of the Faster R-CNN [20]. FPN [12] uses the simplified structure using the 2x upsampling and 1x1 convolution for top-down and lateral connections, respectively. RefineDet [25] employs two-step cascade regression for the top-down connection.

Figure 2: The overall architecture of the proposed ScarfNet: The ScarfNet consists of two modules: ScNet and ArNet. The ScNet aggregates the pyramidal features obtained from the bottom-up CNN pipeline. Then, the ArNet distributes the fused semantics to each pyramid level. The final high-level semantic features are generated by channel-wise concatenation between the output of the ScarfNet and the original pyramidal features. The detailed structures of the matching block and attention block are depicted in the yellow boxes.

3 Proposed Object Detector

In this section, we introduce the details on the proposed ScarfNet architecture.

3.1 Existing Feature Pyramid Methods

The feature pyramid-based object detectors base the decision on the k feature maps across the different pyramid levels in order to detect the various sizes of objects. As shown in Fig. 1 (a), the baseline detectors use the feature map Xl at the lth pyramidal level

Xl=Bl(Xl-1) (1)
Detection Outputs=Dl(Xl), (2)

where l=n-k+1,,n. Note that X1:n-k(=[X1,X2,,Xn-k]) is the feature maps produced by the backbone network and Xn-k+1:n is the bottom-up features from the subsequent convolutional layers. Bl() denotes the operation performed by the lth convolutional layer and Dl() denotes the detection sub-network that often applies a single 3x3 convolutional layer to produce the output of classification and box regression. Due to the different depths from the input to each pyramidal feature, the shallow bottom-level features suffer from the lack of semantic information.

In order to reduce the semantic gap between different pyramid levels, several works proposed the top-down structure using lateral connections illustrated in Fig. 1 (c). This structure propagates the high-level semantics from top to bottom layers with the increased resolution while keeping the spatially high resolution through the lateral connections. The lth feature map X1 generated by this method is expressed as

Xl=Ll(Xl)Tl(Xl+1) (3)
Detection Outputs=Dl(Xl) (4)

where l=n-k+1,,n. Note that Ll() is the operation for the lth lateral connection and Tl() is the operation for the lth top-down connection. The operator represents the combining operation for two feature maps, e.g., channel-wise concatenation and addition. Different methods (e.g., DSSD [5], StairNet [22], TDM [21], FPN [12], and RefineDet [25]) employs the slightly different structures for Ll() and Tl(). While these methods promote the abstraction level for the pyramidal features, they still have some limitations. Since the top-down connection propagates the semantic information in a unidirectional way, the semantics are not evenly distributed to all pyramid levels. As a result, the semantic gap between the pyramidal features still remains. Next, such uni-lateral processing of the features has the limited capacity to produce rich contextual information for increasing the semantic levels in all scales. In order to address these problems, we develop a new architecture that uses the biLSTM to generate the deeply fused semantics through bi-lateral connections between all pyramid scales. In the following subsections, we will present the details of our design.

3.2 ScarfNet: Overall Architecture

Our ScarfNet attempts to resolve the discrepancy of the semantic information in two steps; 1) combining the scattered semantic information using biLSTM and 2) redistributing the fused semantics back to each pyramid level using the channel-wise attention model. The overall architecture of the ScarfNet is depicted in Fig. 2. Taking the k pyramidal features Xn-k+1:n as input, the ScarfNet produces the new lth pyramidal feature map Xl as

Xl =ScarfNetl(Xn-k+1:n) (5)
=XlArNetl(ScNet(Xn-k+1:n)) (6)
Detection Outputs=Dl(Xl) (7)

where l=n-k+1,,n. As seen in (6), the ScarfNet consists of two sub-networks; semantic combining network (ScNet) and attentive redistribution network (ArNet). First, the ScNet merges the pyramidal features Xn-k+1:n through the biLSTM and produces the output features with the fused semantics. Second, the ArNet collects the output features from the biSLTM and applies the channel-wise attention model to produce highly semantic multi-scale features, which are concatenated to the original pyramidal features. Finally, the resulting feature maps are individually processed by the detection sub-network Dl() to produce the results for object detection.

3.3 Semantic Combining Network (ScNet)

The feature maps Xn-k+1:nf produced by the ScNet is obtained

Xn-k+1:nf=ScNet(Xn-k+1:n), (8)

where Xlf is the output feature map for the lth layer. Fig. 3 depicts the detailed structure of the ScNet. The ScNet uniformly fuse the semantics scattered in the different pyramid levels using the biLSTM. The biLSTM can selectively fuse the contextual information underlying in the multi-scale features through the gating function.

Figure 3: The structure of the ScNet: The matching block and biLSTM are applied to generate the fused feature map Xlf. Note that the matching block applies bi-linear interpolation and 1x1 convolution to make the spatial and channel dimensions equal for the inputs to biLSTM.

As shown in Fig. 3, the ScNet consists of the matching block and the biLSTM block. The matching block first resizes the pyramidal features Xn-k+1:n such that they have the same size as the largest pyramidal feature. Then, it adjusts the channel dimension of the input using the 1x1 convolutional layer. As a result, the matching block produces the feature maps of the same spatial and channel dimensions for the biLSTM. Note that resizing operation is performed by the bi-linear interpolation. The biLSTM used in the SCNet follows the structure of [23], which has significantly saved computation by using the convolutional layers for the input connection and computing the gating parameters based on the result of global average pooling. Specifically, the operations performed by the biLSTM in[23] is summarized as

X¯l=GlobalAveragePooling(Xl) (9)
X¯l-1f=GlobalAveragePooling(Xl-1f) (10)
il=σ(WxiX¯l+WxfiX¯l-1f+bi) (11)
fl=σ(WxfX¯l+WxffX¯l-1f+bf) (12)
ol=σ(WxoX¯l+WxfoX¯l-1f+bo) (13)
Gl=tanh(Wxc*Xl+Wxfc*Xl-1f+bc) (14)
Ct=XlCl-1+ilGl (15)
Xlf=oltanh(Cl), (16)

where denotes the Hadamard product. The state update of the biLSTM is conducted in both forward and backward directions. Note that we only provide the forward update and the equations are similar for the backward update.

Figure 4: The structure of the ArNet: The ArNet concatenates the fused feature maps Xn-k+1:nf and applies the channel-wise attention. Then, the spatial and channel dimensions of the resulting feature maps are adjusted by the matching block.
{adjustbox}

width=0.6 \Xhline4 Method Backbone Input size mAP (%) VOC 2007 VOC 2012 SSD300* [16] (baseline) VGG-16 300×300 77.5 75.8 SSD512* [16] (baseline) VGG-16 512×512 79.8 78.5 StairNet [22] VGG-16 300×300 78.8 76.4 Faster R-CNN [20] VGG-16 1000×600 73.2 70.4 ION [1] VGG-16 1000×600 76.5 76.4 SSD321 [5] ResNet-101 321×321 77.1 75.4 SSD513 [5] ResNet-101 513×513 80.6 79.4 DSSD321 [5] ResNet-101 321×321 78.6 76.3 DSSD513 [5] ResNet-101 513×513 81.5 80.0 R-FCN [3] ResNet-101 1000×600 80.5 77.6 RetinaNet500 [13] (baseline) ResNet-101 833×500 83.0 - Proposed with SSD300 VGG-16 300×300 79.4 77.2 Proposed with SSD512 VGG-16 512×512 81.6 79.8 Proposed with RetinaNet500 ResNet-101 833×500 83.5 - \Xhline4

Table 1: PASCAL VOC 07/12 detection results: The detection results for VOC 2017 are evaluated on VOC 2007 test set after trained on VOC 2007 trainval and VOC 2012 trainval. Those for VOC 2012 are evaluated on VOC 2012 test set when trained on VOC 2007 test, VOC2007 trainval, and VOC 2012 trainval sets.

3.4 Attentive Redistribution Network (ArNet)

The ArNet aims to produce the high-level semantic feature map, which is concatenated with the original pyramidal feature map Xl as

Xl=XlArNetl(Xn-k+1:nf), (17)

where the operator denotes channel-wise concatenation. The detailed structure of ArNet is depicted in Fig. 4. The ArNet concatenates the outputs Xn-k+1:nf of the biLSTM and apply the channel-wise attention to them. The attention weights are obtained by constructing the 1x1 vector using the global average pooling [10] and passing it through two fully connected layers followed by the sigmoid function. Note that this channel-wise attention model allows for selective propagation of the semantics to each pyramid level. Once the attention weights are applied, the matching block downsamples the resulting feature maps to the original size of the pyramidal features and applies 1x1 convolution to match the channel dimensions with those of the original pyramidal features. Finally, the output of the matching block is concatenated with the original feature Xl to produce the highly semantic feature Xl.

4 Experiments

In this section, we evaluate the performance of the proposed ScarfNet. We compare our detector with the other methods and conduct the extensive performance analysis to understand the behavior of our architecture.

4.1 Experiment Setup

Our ScarfNet is applied to the the baseline object detectors, Faster R-CNN [20], SSD [16] and RetinaNet [13]. In the case of Faster R-CNN and RetinaNet, we replace the original FPN part with the feature generation by our ScarfNet. We compare our method with the baseline detectors Faster R-CNN [20], SSD [16] and RetinaNet [13] as well as the other competitive algorithms including ION [1], R-FCN [3], DSSD [5] and StairNet [22]. We measure mean average precision (mAP) in % on the three widely used datasets for object detection benchmark; PASCAL VOC 2007, PASCAL VOC 2012 [4] and MS COCO [14].

{adjustbox}

width=0.9 \Xhline4\addstackgap[.5]0 Method Network Backbone Module Input size fps AP AP50 AP75 APS APM APL \addstackgap[.5]0 two-stage Faster R-CNN* [20] ResNeXt-101 FPN 833×500 15.3 37.6 59.1 40.7 19.2 41.8 52.3 ResNeXt-101 FPN 1333×800 10.3 41.9 63.9 45.9 25.0 45.3 52.3 \addstackgap[.5] Scarf R-CNN (ours) ResNeXt-101 SCARF 833×500 13.8 38.5 59.9 41.5 19.1 42.9 54.1 ResNeXt-101 SCARF 1333×800 8.9 42.8 64.3 47.1 26.0 45.7 52.9 \addstackgap[.5]0 one-stage SSD513 [5] ResNet-101 - 513×513 12.5 31.2 50.4 33.3 10.2 34.5 49.8 DSSD513 [5] ResNet-101 DSSD 513×513 10.0 33.2 53.3 35.2 13.0 35.4 51.1 \addstackgap[.5] Scarf SSD513 (ours) ResNet-101 SCARF 513×513 11.5 34.5 54.1 36.3 15.1 36.1 51.6 \addstackgap[.5] RetinaNet [13] ResNet-101 FPN 833×500 15.4 34.4 53.1 36.8 14.7 38.5 49.1 ResNeXt-101 FPN 1333×800 9.3 40.8 61.1 44.1 24.1 44.2 51.2 \addstackgap[.5] Scarf RetinaNet (ours) ResNet-101 SCARF 833×500 13.6 35.1 53.8 37.7 15.8 38.7 49.0 \addstackgap[.5] ResNeXt-101 SCARF 1333×800 8.4 41.6 62.0 44.6 24.5 45.5 52.3 \Xhline4

Table 2: Detection results on MS COCO test-dev dataset: The symbol “*” indicates our re-implemented results. The expression “x×y” means re-scaling of the input image introduced in the original RetinaNet paper.

4.2 Network Configuration

The advantage of our ScarfNet is that we do not have many hyper-parameters to be determined. Note that the spatial dimensions of the feature maps are readily determined based on those of the baseline detectors. The channel dimensions of the intermediate feature maps are fixed over the pipeline between two matching blocks in the ScNet and ArNet. Thus, we only need to choose for this channel dimension. According to the empirical results, we set the channel dimension to 256.

4.3 Performance Evaluation

4.3.1 PASCAL VOC Results

Training on PASCAL VOC 2007 Dataset: The object detectors under consideration are trained with the VOC 2007 trainval and the VOC 2012 trainval sets and evaluated with the VOC 2007 test set. When the ScarfNet is combined with the SSD baseline, we train our model over 120k iterations (around 240 epochs). We set the learning rate to 10-3 for the first 80k iterations, decay the learning rate to 10-4 for the next 20k iterations, and use the learning rate of 10-5 for the last 20k iterations. The mini-batch size is set to 32, the momentum for the stochastic grandient descent (SGD) update is set to 0.9, and the weight decay is set to 0.0005. When our method is combined with the RetinaNet baseline, we set the learning rate to 5×10-3 for the first 60k iterations, decay the learning rate to 5×10-4 for the next 20k iterations, and use the learning rate of 5×10-5 for the last 10k iterations. Other parameters are equally set except for the weight decay of 0.0001.
     Training on PASCAL VOC 2012 Dataset: The object detectors are trained with the VOC 2007 trainval, the VOC 2007 test and the VOC 2012 trainval sets and evaluated with the VOC 2012 test set. When our model is combined with the SSD baseline, a total of 200k iterations are run with the same training parameters as in the VOC 2007 case. Note that we use the learning rate of 10-3 for the first 120k iterations, 10-4 for the next 40k iterations, and 10-5 for the rest.
     Performance Comparison: Table 1 shows the mAP performance of the object detectors under comparison evaluated on the PASCAL VOC 2007 and 2012 test sets. For both PASCAL 2007 and 2012 cases, we observe that the semantic features generated by our ScarfNet offer the significant performance gain over the baseline detectors. In the case of PASCAL VOC 2007, the proposed method achieves 1.9% and 1.8% mAP gains over the SSD300 and SSD512 baselines, respectively. The proposed method also outperforms the RetinaNet baseline by 0.5%. Since the RetinaNet baseline employs the top-down structure based on FPN [12], we can deduce that the features generated by our method are superior to those by the FPN. Our object detector also achieves better performance than the other competing algorithms including DSSD [5], ION [1], R-FCN [3]. As shown in Table 1, our ScarfNet detector achieves the state of the art performance for PASCAL VOC 2007 dataset. Through the detection accuracy with PASCAL VOC 2012 dataset slightly degrades as compared to PASCAL VOC 2017, the tendency of detection results observed for the PASCAL VOC 2007 remains. Note that the proposed detector maintains the performance gain of 1.4% and 1.3% mAP over the SSD300 and SSD500 baselines, respectively.

{adjustbox}

width=0.48 \Xhline4 Method mAP \addstackgap[.5]0 Ablation study Basedline (SSD) 77.5 \addstackgap[.5] biLSTM 79.1 \addstackgap[.5] biLSTM + channel-wise attention 79.4 \addstackgap[.5]0 Other fusion strategy (used with channel-wise attention) 1x1 conv.-based fusion 78.9 \addstackgap[.5] uniLSTM 78.7 \addstackgap[.5] Top-down structure with lateral connections 78.6 \Xhline4

Table 3: Results of ablation study on VOC 2007 test dataset.
{adjustbox}

width=0.42 \Xhline4 Semantic feature generation strategy   Addition Concat. \pbox20cmChannel
dimension
64 78.3 78.8
128 78.6 79.1 256 79.1 79.4 512 79.5 79.2 1024 79.4 79.2 \Xhline4

Table 4: mAP (%) performance for various combinations of channel dimension and semantic feature generation strategy when evaluated on VOC 2007 test set

4.3.2 COCO Results

Training: The object detectors under comparison are trained with the MS COCO trainval35k split [1] (union of 80k images from train and a random 35k subset of images from 40k image val split) and evaluate it with the MS COCO test-dev. For the training of proposed structure based on RetinaNet [13], we set the learning rate to 10-2 for the first 60k iterations, decay the learning rate to 10-3 for the next 20k iterations, and use the learning rate of 10-5 for the last 20k iterations. The mini-batch size is set to 16, the momentum is set to 0.9, and the weight decay is set to 0.0001.
     Performance comparison: Table 2 provides the detection accuracy of the algorithms tested on MS COCO dataset. The experiment is conducted on the various baseline detectors and feature pyramid modules. We consider the performance comparison based on both two-stage detector and one-stage detector, and use the FPN [12] as the competing feature pyramid method. The proposed object detector achieves the performance gain over the Faster R-CNN [13] baseline by 0.9%, 0.4%, and 1.2% for AP, AP50, and AP75, respectively. Also, our ScarfNet achieves 34.5% and 41.5% AP which is 1.3% and 0.8% higher than DSSD513 and RetinaNet baseline, respectively.

Figure 5: Visualization of the feature map: (top row) input image, (middle row) conv4_3 layer feature (X1) from feature pyramid in the SSD300, (bottom row) conv4_3 layer feature (X1) generated from the ScarfNet. Since the conv4_3 layer feature map X1 is shallow, it fails to place strong activation properly on the objects. On the contrary, the semantic feature generated by our ScarfNet seems to capture the characteristics of the objects well.

4.4 Performance Analysis

4.4.1 Ablation Study

Benefits of biLSTM: It is worth investigating the effectiveness of the biLSTM for fusing the multi-scale features. Table 3 compares our method with the different fusion strategies including the 1x1 convolutional layer, the top-down structure, and the unidirectionalLSTM. Our biLSTM achieves better performance than the others. This seems why parameter sharing, gating units, and bilateral processing of the biLSTM effectively control high-level information to reduce the subtle semantic gap between the hierarchical features.
      Network Parameter Search As mentioned, we need to determine the channel dimension of the intermediate feature maps. We also wonder which strategy is better between the element-wise addition versus channel-wise concatenation to combine the output of the ScarfNet with the original feature pyramid. In Table 4, we evaluate the performance of our detector for various combinations of the channel dimensions (64, 128, 256, 512 versus 1024) and feature combining strategies (element-wise addition versus channel-wise concatenation). According to Table 4, the combination of the 512 channel dimension with element-wise addition leads to the best detection accuracy. However, using 512 channel significantly increase the computational complexity of the entire network, we choose the 256 channel dimension with channel-wise concatenation.

4.4.2 Feature Visualization

We investigate the effectiveness of the ScarfNet via feature visualization. Fig. 5 compares the original pyramidal feature map X1 of the largest size (middle row) with the semantic feature map X1 from our ScarfNet (bottom row). In order to obtain the heat map, we take the channel with the highest average activation in the spatial domain. Due to the lack of semantic cue in the original feature map X1, it often fails to activate on the objects properly. On the contrary, we observe that the feature map X1 has strong activation on the whole region occupied by the objects, which would lead to the improvement in the overall detection performance.

5 Conclusions

In this paper, we proposed a deep architecture generating the multi-scale features with strong semantics to reliably detect the objects in various sizes. Our ScarfNet method transforms the pyramidal features produced by the baseline detector into evenly abstract features. To achieve this goal, the proposed ScarfNet fuses the pyramidal features using the biLSTM and distributes the semantics back to each multi-scale feature. We verified through the experiments conducted with PASCAL VOC and MS COCO datasets that the proposed ScarfNet offers a significant gain in detection performance over the baseline detectors. We also showed that our object detector achieves the state of the art performance on PASCAL VOC and COCO benchmark.

References

  • [1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2874–2883. Cited by: §2.2, Table 1, §4.1, §4.3.1, §4.3.2.
  • [2] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos (2016) A unified multi-scale deep convolutional neural network for fast object detection. European Conference on Computer Vision (ECCV), pp. 354–370. Cited by: §1.
  • [3] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems (NIPS), pp. 379–387. Cited by: Table 1, §4.1, §4.3.1.
  • [4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IEEE International Conference on Computer Vision (ICCV) 88 (2), pp. 303–338. Cited by: §1, §4.1.
  • [5] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §1, §2.2, §3.1, Table 1, §4.1, §4.3.1, Table 2.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2017) Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.1.
  • [7] R. Girshick (2015) Fast r-cnn. IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §1, §2.1.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. pp. 2961–2969. Cited by: §1.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.1.
  • [10] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. pp. 7132–7141. Cited by: §3.4.
  • [11] T. Kong, A. Yao, Y. Chen, and F. Sun (2016) Hypernet: towards accurate region proposal generation and joint object detection. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 845–853. Cited by: §2.2.
  • [12] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. IEEE conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.2, §3.1, §4.3.1, §4.3.2.
  • [13] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. IEEE International Conference on Computer Vision (ICCV), pp. 2999–3007. Cited by: §1, §1, §1, §2.1, §2.2, Table 1, §4.1, §4.3.2, Table 2.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, §4.1.
  • [15] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2018) Deep learning for generic object detection: a survey. arXiv preprint arXiv:1809.02165. Cited by: §1, §2.1.
  • [16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: Figure 1, §1, §1, §1, §2.1, §2.2, Table 1, §4.1.
  • [17] C. P. Papageorgiou, M. Oren, and T. Poggio (1998) A general framework for object detection. pp. 555–562. Cited by: §1.
  • [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §1, §2.1.
  • [19] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. Cited by: §1, §2.1.
  • [20] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (NIPS), pp. 91–99. Cited by: §1, §1, §2.1, §2.2, Table 1, §4.1, Table 2.
  • [21] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851. Cited by: §2.2, §3.1.
  • [22] S. Woo, S. Hwang, and I. S. Kweon (2018) Stairnet: top-down semantic aggregation for accurate one shot detection. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1093–1102. Cited by: §1, §2.2, §3.1, Table 1, §4.1.
  • [23] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015) Convolutional lstm network: a machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems (NIPS), pp. 802–810. Cited by: §3.3.
  • [24] L. Zhang, G. Zhu, L. Mei, P. Shen, S. A. A. Shah, and M. Bennamoun (2018) Attention in convolutional lstm for gesture recognition. Advances in Neural Information Processing Systems (NIPS), pp. 1957–1966. Cited by: §1.
  • [25] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018) Single-shot refinement neural network for object detection. IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 4203–4212. Cited by: §2.2, §3.1.