Towards Accurate One-Stage Object Detection with AP-Loss

  • 2020-02-13 18:23:50
  • Kean Chen, Jianguo Li, Weiyao Lin, John See, Ji Wang, Lingyu Duan, Zhibo Chen, Changwei He, Junni Zou
  • 0

Abstract

One-stage object detectors are trained by optimizing classification-loss andlocalization-loss simultaneously, with the former suffering much from extremeforeground-background class imbalance issue due to the large number of anchors.This paper alleviates this issue by proposing a novel framework to replace theclassification task in one-stage detectors with a ranking task, and adoptingthe Average-Precision loss (AP-loss) for the ranking problem. Due to itsnon-differentiability and non-convexity, the AP-loss cannot be optimizeddirectly. For this purpose, we develop a novel optimization algorithm, whichseamlessly combines the error-driven update scheme in perceptron learning andbackpropagation algorithm in deep networks. We verify good convergence propertyof the proposed algorithm theoretically and empirically. Experimental resultsdemonstrate notable performance improvement in state-of-the-art one-stagedetectors based on AP-loss over different kinds of classification-losses onvarious benchmarks, without changing the network architectures. Code isavailable at https://github.com/cccorn/AP-loss.

 

Quick Read (beta)

Towards Accurate One-Stage Object Detection with AP-Loss

Kean Chen1, Jianguo Li2, Weiyao Lin1  , John See3, Ji Wang4, Lingyu Duan5, Zhibo Chen4, Changwei He4, Junni Zou1
1Shanghai Jiao Tong University, China, 2 Intel Labs, China,
3 Multimedia University, Malaysia, 4 Tencent YouTu Lab, China, 5 Peking University, China
Corresponding Author, Email: [email protected]
Abstract

One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the error-driven update scheme in perceptron learning and backpropagation algorithm in deep networks. We verify good convergence property of the proposed algorithm theoretically and empirically. Experimental results demonstrate notable performance improvement in state-of-the-art one-stage detectors based on AP-loss over different kinds of classification-losses on various benchmarks, without changing the network architectures. Code is available at https://github.com/cccorn/AP-loss.

1 Introduction

(a) Acc =0.88
(b) AP =0.33
Figure 1: Dashed red boxes are the ground truth object boxes. The orange filled boxes and other blank boxes are anchors with positive and negative ground truth labels, repectively. (a) shows that the detection performance is poor but the classification accuracy is still high due to large number of true negatives. (b) shows the ranking metric AP can better reflect the actual condition as it does not suffer from the large number of true negatives.
Figure 2: Overall framework of the proposed approach. We replace the classification-task in one-stage detectors with a ranking task, where the ranking procedure produces the primary terms of AP-loss and the corresponding label vector. The optimization algorithm is based on an error-driven learning scheme combined with backpropagation. The localization-task branch is not shown here due to no modification.

Object detection needs to localize and recognize the objects simultaneously from the large backgrounds, which remains challenging due to the imbalance between foreground and background. Deep learning based detection solutions usually adopt a multi-task architecture, which handles classification task and localization task with different loss functions. The classification task aims to recognize the object in a given box, while the localization task aims to predict the precise bounding box of the object. Two-stage detectors [25, 7, 2, 14] first generates a limited number of object box proposals, so that the detection problem can be solved by adopting classification task on those proposals. However, the circumstance is different for one-stage detectors, which need to predict the object class directly from the densely pre-designed candidate boxes. The large number of boxes yield the imbalance between foreground and background which makes the optimization of classification task easily biased and thus impacts the detection performance. It is observed that the classification metric could be very high for a trivial solution which predicts negative label for almost all candidate boxes, while the detection performance is poor. (a) illustrates one such example.

To tackle this issue in one-stage object detectors, some works introduce new classification losses such as balanced loss [23], Focal Loss [15], as well as tailored training method such as Online Hard Example Mining (OHEM) [18, 31]. These losses model each sample (anchor box) independently, and attempt to re-weight the foreground and background samples in classification losses to cater for the imbalance condition; this is done without considering the relationship among different samples. The designed balance weights are hand-crafted hyper-parameters, which do not generalize well across datasets. We argue that the gap between classification task and detection task hinder the performance of one-stage detectors. In this paper, instead of modifying the classification loss, we propose to replace classification task with ranking task in one-stage detectors, where the associated ranking loss explicitly models sample relationship, and is invariant to the ratio of positive and negative samples. As shown in (b), we adopt Average Precision (AP) as our target loss which is inherently more consistent with the evaluation metric for object detection.

However, it is non-trivial to directly optimize the AP-loss due to the non-differentiability and non-decomposability, so that standard gradient descent methods are not amenable for this case. There are three aspects of studies for this issue. First, AP based loss is studied within structured SVM models [37, 19], which restricts in linear SVM model so that the performance is limited. Second, a structured hinge loss [20] is proposed to optimize the upper bound of AP-loss instead of the loss itself. Third, approximate gradient methods [34, 9] are proposed for optimizing the AP-loss, which are less efficient and easy to fall into local optimum even for the case of linear models due to the non-convexity and non-quasiconvexity of the AP-loss. Therefore, it is still an open problem for the optimization of the AP-loss.

In this paper, we address this challenge by replacing the classification task in one-stage detectors with a ranking task, so that we handle the class imbalance problem with a ranking based loss named AP-loss. Furthermore, we propose a novel error-driven learning algorithm to effectively optimize the non-differentiable AP based objective function. More specifically, some extra transformations are added to the score output of one-stage detector to obtain the AP-loss, which includes a linear transformation that transforms the scores to pairwise differences, and a non-linear and non-differentiable “activation function” that transform the pairwise differences to the primary terms of the AP-loss. Then the AP-loss can be obtained by the dot product between the primary terms and the label vector. It is worth noting that the difficulty for using gradient method on the AP-loss lies in passing gradients through the non-differentiable activation function. Inspired by the perceptron learning algorithm [26], we adopt an error-driven learning scheme to directly pass the update signal through the non-differentiable activation function. Different from gradient method, our learning scheme gives each variable an update signal proportional to the error it makes. Then, we adopt the backpropagation algorithm to transfer the update signal to the weights of neural network. We theoretically and experimentally prove that the proposed optimization algorithm does not suffer from the non-differentiability and non-convexity of the objective function. The main contributions of this paper are summarized as below:

  • We propose a novel framework in one-stage object detectors which adopts the ranking loss to handle the class imbalance issue.

  • We propose an error-driven learning algorithm that can efficiently optimize the non-differentiable and non-convex AP-based objective function with both theoretical and experimental verifications.

  • We show notable performance improvement with the proposed method on state-of-the-art one-stage detectors over different kinds of classification-losses without changing the model architecture.

2 Related Work

One-stage detectors: In object detection, the one-stage approaches have relatively simpler architecture and higher efficiency than two-stage approaches. OverFeat [28] is one of the first CNN-based one-stage detectors. Thereafter, different designs of one-stage detectors are proposed, including SSD [18], YOLO [23], DSSD [6] and DSOD [30, 13]. These methods demonstrate good processing efficiency as one-stage detectors, but generally yield lower accuracy than two-stage detectors. Recently, RetinaNet [15] and RefineDet [38] narrow down the performance gap (especially on the challenging COCO benchmark [16]) between one-stage approaches and two-stage approaches with some innovative designs. As commonly known, the performance of one-stage detectors benefits much from densely designed anchors, which introduce extreme imbalance between foreground and background samples. To address this challenge, methods like OHEM [18, 31] and Focal Loss [15] have been proposed to reduce the loss weight for easy samples. However, there are two hurdles that are still open to discussion. Firstly, hand-crafted hyper-parameters for weight balance do not generalize well across datasets. Secondly, the relationship among sample anchors is far from well modeled.

AP as a loss for Object Detection: Average Precision (AP) is widely used as the evaluation metric in many tasks such as object detection [5] and information retrieval [27]. However, AP is far from a good and common choice as an optimization goal in object detection due to its non-differentiability and non-convexity. Some methods have been proposed to optimize the AP-loss in object detection, such as AP-loss in the linear structured SVM model [37, 19], structured hinge loss as upper bound of the AP-loss [20], approximate gradient methods [34, 9], reinforcement learning to fine-tune a pre-trained object detector with AP based metric [22]. Although these methods give valuable results in optimizing the AP-loss, their performances are still limited due to the intrinsic limitations. In details, the proposed approach differs from them in 4 aspects. (1) Our approach can be used for any differentiable linear or non-linear models such as neural networks, while [37, 19] only work for linear SVM model. (2) Our approach directly optimizes the AP-loss, while [20] introduces notable loss gap after relaxation. (3) Our approach dose not approximate the gradient and dose not suffer from the non-convexity of objective function as in [34, 9]. (4) Our approach can train the detectors in an end-to-end way, while [22] cannot.

Perceptron Learning Algorithm: The core of our optimization algorithm is the “error-driven update” which is generalized from the perceptron learning algorithm [26], and helps overcome the difficulty of the non-differentiable objective functions. The perceptron is a simple artificial neuron using the Heaviside step function as the activation function. The learning algorithm was first invented by Frank Rosenblatt [26]. As the Heaviside step function in perceptron is non-differentiable, it is not amenable for gradient method. Instead of using a surrogate loss like cross-entropy, the perceptron learning algorithm employs an error-driven update scheme directly on the weights of neurons. This algorithm is guaranteed to converge in finite steps if the training data is linearly separable. Further works like [11, 1, 35] have studied and improved the stability and robustness of the perceptron learning algorithm.

3 Method

We aim to replace the classification task with AP-loss based ranking task in one-stage detectors such as RetinaNet [15]. Figure 2 shows the two key components of our approach, i.e., the ranking procedure and the error-driven optimization algorithm. Below, we will first present how AP-loss is derived from traditional score output. Then, we will introduce the error-driven optimization algorithm. Finally, we also present the theoretical analyses of the proposed optimization algorithm and outline the training details. Note that all changes are made on the loss part of the classification branch without changing the backbone model and localization branch.

3.1 Ranking Task and AP-Loss

3.1.1 Ranking Task

(a)
(b)
Figure 3: Comparison of label assignments. The dashed red box is the ground truth box with class k. (a) In traditional classification task of one-stage detectors, the anchor is assigned a foreground label k. (b) In our ranking task framework, the anchor replicates K times, and we assign the k-th anchor to label 1, others 0.

In traditional one-stage detectors, given input image I, suppose the pre-defined boxes (also called anchors) set is B, each box biB will be assigned a label ti{-1,0,1,,K} based on ground truth and the IoU strategy [7, 25], where label 1K means the object class ID, label “0” means background and label “-1” means ignored boxes. During training and testing phase, the detector outputs a score-vector (si0,,siK) for each box bi.

In our framework, instead of one box with K+1 dimensional score predictions, we replicate each box bi for K times to obtain bik where k=1,,K, and the k-th box is responsible for the k-th class. Each box bik will be assigned a label tik{-1,0,1} through the same IoU strategy (label -1 for not counted into the ranking loss). Therefore, in the training and testing phase, the detector will predict only one scalar score sik for each box bik. Figure 3 illustrates our label formulation and the difference to traditional case.

The ranking task dictates that every positive boxes should be ranked higher than all the negative boxes w.r.t their scores. Note that AP of our ranking result is computed over the scores from all classes. This is slightly different from the evaluation metric meanAP for object detection systems, which computes AP for each class and obtains the average value. We compute AP this way because the score distribution should be unified for all classes while ranking each class separately cannot achieve this goal.

3.1.2 AP-Loss

For simplicity, we still use B to denote the anchor box set after replication, and bi to denote the i-th anchor box without the replication subscript. Each box bi thus corresponds to one scalar score si and one binary label ti. Some transformations are required to formulate a ranking loss as illustrated in Figure 2. First, the difference transformation transfers the score si to the difference form

i,j,xij=-(s(bi;𝜽)-s(bj;𝜽))=-(si-sj) (1)

where s(bi;𝜽) is a CNN based score function with weights 𝜽 for box bi. The ranking label transformation transfers labels ti to the corresponding pairwise ordering form

i,j,yij=𝟏ti=1,tj=0 (2)

where 𝟏 is a indicator function which equals to 1 only if the subscript condition holds (i.e., ti=1,tj=0), otherwise 0. Then, we define an vector-valued activation function 𝑳() to produce the primary terms of the AP-loss as

Lij(𝒙)=H(xij)1+k𝒫𝒩,kiH(xik)=Lij (3)

where H() is the Heaviside step function:

H(x)={0x<01x0 (4)

A ranking is denoted as proper ranking when there are no two samples scored equally (i.e., ij,sisj). Without loss of generality, we will treat all rankings as a proper ranking by breaking ties arbitrarily. Now, we can formulate the AP-loss AP as

AP=1-AP=1-1|𝒫|i𝒫rank+(i)rank(i)=1-1|𝒫|i𝒫1+j𝒫,jiH(xij)1+j𝒫,jiH(xij)+j𝒩H(xij)=1|𝒫|i𝒫j𝒩Lij=1|𝒫|i,jLijyij=1|𝒫|𝑳(𝒙),𝒚 (5)

where rank(i) and rank+(i) denote the ranking position of score si among all valid samples and positive samples respectively, 𝒫={i|ti=1}, 𝒩={i|ti=0}, |𝒫| is the size of set 𝒫, 𝑳 and 𝒚 are vector form for all Lij and yij respectively, , means dot-product of two input vectors. Note that 𝒙,𝒚,𝑳d, where d=(|𝒫|+|𝒩|)2.

Finally, the optimization problem can be written as:

min𝜽AP(𝜽)=1-AP(𝜽)=1|𝒫|𝑳(𝒙(𝜽)),𝒚 (6)

where 𝜽 denotes the weights of detector model. As the activation function 𝑳() is non-differentiable, a novel optimization/learning scheme is required instead of the standard gradient descent method.

Besides the AP metric, other ranking based metric can also be used to design the ranking loss for our framework. One example is the AUC-loss [12] which measures the area under ROC curve for ranking purpose, and has a slightly different “activation function” as

Lij(𝒙)=H(xij)|𝒩| (7)

As AP is consistent with the evaluation metric of the object detection task, we argue that AP-loss is intuitively more suitable than AUC-loss for this task, and will provide empirical study in our experiments.

3.2 Optimization Algorithm

3.2.1 Error-Driven Update

Recalling the perceptron learning algorithm, the update for input variable is “error-driven”, which means the update is directly derived from the difference between desired output and current output. We adopt this idea and further generalize it to accommodate the case of activation function with vector-valued input and output. Suppose xij is the input and Lij is the current output, the update for xij is thus

Δxij=Lij*-Lij (8)

where Lij* is the desired output. Note that the AP-loss achieves its minimum possible value 0 when each term Lijyij=0. There are two cases. If yij=1, we should set the desired output Lij*=0. If yij=0, we do not care the update and set it to 0, since it does not contribute to the AP-loss. Consequently, the update can be simplified as

Δxij=-Lijyij (9)

3.2.2 Backpropagation

We now have the desired vector-form update Δ𝒙, and then will find an update for model weights Δ𝜽 which will produce most appropriate movement for 𝒙. We use dot-product to measure the similarity of successive movements, and regularize the change of weights (i.e. Δ𝜽) with L2-norm based penalty term. The optimization problem can be written as:

argminΔ𝜽{-Δ𝒙,𝒙(𝜽(n)+Δ𝜽)-𝒙(𝜽(n))+λΔ𝜽22} (10)

where 𝜽(n) denotes the model weights at the n-th step. With that, the first-order expansion of 𝒙(𝜽) is given by:

𝒙(𝜽)=𝒙(𝜽(n))+𝒙(𝜽(n))𝜽(𝜽-𝜽(n))+o(𝜽-𝜽(n)) (11)

where 𝒙(𝜽(n))/𝜽 is the Jacobian matrix of vector-valued function 𝒙(𝜽) at 𝜽(n). Ignoring the high-order infinitesimal, we obtain the step-wise minimization process:

𝜽(n+1)-𝜽(n)=argminΔ𝜽{-Δ𝒙,𝒙(𝜽(n))𝜽Δ𝜽+λΔ𝜽22} (12)

The optimal solution can be obtained by finding the stationary point. Then, the form of optimal Δ𝜽 is consistent with the chain rule of derivative, which means, it can be directly implemented by setting the gradient of xij to -Δxij (c.f. Equation 9) and proceeding with backpropagation. Hence the gradient for score si can be obtained by backward propagating the gradient through the difference transformation:

gi=-j,kΔxjkxjksi=jΔxij-jΔxji=jLjiyji-jLijyij (13)

3.3 Analyses

Convergence: To better understand the characteristics of the AP-loss, we first provide a theoretical analysis on the convergence of the optimization algorithm, which is generalized from the convergence property of the original perceptron learning algorithm.

Proposition 1

The AP-loss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;

(2) the training data is linearly separable.

The proof of this proposition is provided in Appendix-1 of supplementary. Although convergence is somewhat weak due to the need of strong conditions, it is non-trivial since the AP-loss function is not convex or quasiconvex even for the case of linear model and linearly separable data, so that gradient descent based algorithm may still fail to converge on a smoothed AP-loss function even under such strong conditions. One such example is presented in Appendix-2 of supplementary. It means that, under such conditions, our algorithm still optimizes better than the approximate gradient descent algorithm for AP-loss. Furthermore, with some mild modifications, even though the training data is not separable, the accumulated AP-loss can also be bounded proportionally by the best performance of the learning model. More details are presented in Appendix-3 of supplementary.

Consistency: Besides convergence, We observed that the proposed optimization algorithm is inherently consistent with widely used classification-loss functions.

Observation 1

When the activation function L() takes the form of softmax function and loss-augmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on cross-entropy loss and hinge loss respectively.

The detailed analysis of this observation is presented in Appendix-4 of supplementary. We argue that the observed consistency is on the basis of the “error-driven” property. As is known, the gradients of those widely used loss functions are proportional to their prediction errors, where the prediction here refers to the output of activation function. In other words, their activation functions have a nice property: the vector field of prediction errors is conservative, allowing it being the gradient of some surrogate loss function. However, our activation function does not have this property, which makes our optimization not able to express as gradient descent with any surrogate loss function.

3.4 Details of Training Algorithm

{algorithm}

[t] Minibatch training for Interpolated AP \[email protected]@algorithmic[1] \RequireAll scores {si} and corresponding labels {ti} in a minibatch \EnsureGradient of input {gi} \Statei,gi0 \StateMaxPrec0 \State𝒫{iti=1},𝒩{iti=0} \StateOargsort({sii𝒫}) \CommentIndexes of scores sorted in ascending order \ForiO \StateCompute xij=sj-si for all j𝒫𝒩 and Lij for all j𝒩 \CommentAccording to Equation 3 and Equation 14 \StatePrec1-j𝒩Lij \IfPrecMaxPrec \StateMaxPrecPrec \Else\CommentInterpolation \Statej𝒩,LijLij(1-MaxPrec)/(1-Prec) \EndIf\Stategi-j𝒩Lij \CommentAccording to Equation 13 \Statej𝒩,gjgj+Lij \CommentAccording to Equation 13 \EndFor\Statei,gigi/|𝒫| \CommentNormalization

Batch Size AP AP50 AP75
1 52.4 80.2 56.7
2 53.0 81.7 57.8
4 52.8 82.2 58.0
8 53.1 82.3 58.1
(a) Varying batch size
δ AP AP50 AP75
0.25 50.2 80.7 53.6
0.5 51.3 81.6 55.4
1 53.1 82.3 58.1
2 52.8 82.6 57.2
(b) Varying δ for piecewise step function
Interpolated AP AP50 AP75
No 52.6 82.2 57.1
Yes 53.1 82.3 58.1
(c) Interpolated vs. not interpolated
Table 1: Ablation experiments. Models are tested on VOC2007 test set.

Minibatch Training The minibatch training strategy is widely used in deep learning frameworks [8, 18, 15] as it accounts for more stability than the case with batch size equal to 1. The mini-batch training helps our optimization algorithm quite a lot for escaping the so-called “score-shift” scenario. The AP-loss can be computed both from a batch of images and from a single image with multiple anchor boxes. Consider an extreme case: our detector can predict perfect ranking in both image I1 and image I2, but the lowest score in image I1 is even greater than the highest score in image I2. There are “score-shift” between two images so that the detection performance is poor when computing AP-loss per-image. Aggregating scores over images in a mini-batch can avoid such problem, so that the minibatch training is crucial for good convergence and good performance.

Piecewise Step function During early stage of training, scores si are very close to each other (i.e. almost all inputs to Heaviside step function H(x) are near zero), so that a small change of input will cause a big output difference, which destabilizes the updating process. To tackle this issue, we replace H(x) with a piecewise step function:

f(x)={0,x<-δx2δ+0.5,-δxδ1,δ<x (14)

The piecewise step functions with different δ are shown in Figure 4. When δ approaches +0, the piecewise step function approaches the original step function. Note that f() is only different from H() near zero point. We argue that the precise form of the piecewise step function is not crucial. Other monotonic and symmetric smooth functions that only differs from H() near zero point could be equally effective. The choice of δ relates closely to the weight decay hyper-parameter in CNN optimization. Intuitively, parameter δ controls the width of decision boundary between positive and negative samples. Smaller δ enforces a narrower decision boundary, which causes the weights to shrink correspondingly (similar effect to that caused by the weight decay). Further details are presented in the experiments.

Figure 4: Heaviside step function and piecewise step function. (Best viewed in color)

Interpolated AP The interpolated AP [27] is widely adopted by many object detection benchmarks like PASCAL VOC [5] and MS COCO [16]. The common justification for interpolating the precision-recall curve [5] is “to reduce the impact of ’wiggles’ in the precision-recall curve, caused by small variations in the ranking of examples”. Under the same consideration, we adopt the interpolated AP instead of the original version. Specifically, the interpolation is applied on Lij to make the precision at the k-th smallest positive sample monotonically increasing with k where the precision is (1-j𝒩Lij) in which i is the index of the k-th smallest positive sample. It is worth noting that the interpolated AP is a smooth approximation of the actual AP so that it is a practical choice to help to stabilize the gradient and to reduce the impact of ’wiggles’ in the update signals. The details of the interpolated AP based algorithm is summarized in Algorithm 3.4.

4 Experiments

4.1 Experimental Settings

We evaluate the proposed method on the state-of-the-art one-stage detector RetinaNet [15]. The implementation details are the same as in [15] unless explicitly stated. Our experiments are performed on two benchmark datasets: PASCAL VOC [5] and MS COCO [16]. The PASCAL VOC dataset has 20 classes, with VOC2007 containing 9,963 images for train/val/test and VOC2012 containing 11,530 for train/val. The MS COCO dataset has 80 classes, containing 123,287 images for train/val. We implement our codes with the MXNET framework, and conduct experiments on a workstation with two NVidia TitanX GPUs.

PASCAL VOC: When evaluated on the VOC2007 test set, models are trained on the VOC2007 and VOC2012 trainval sets. When evaluated on the VOC2012 test set, models are trained on the VOC2007 and VOC2012 trainval sets plus the VOC2007 test set. Similar to the evaluation metrics used in the MS COCO benchmark, we also report the AP averaged over multiple IoU thresholds of 0.50:0.05:0.95. We set δ=1 in Equation 14. We use ResNet [8] as the backbone model which is pre-trained on the ImageNet-1k classification dataset [4]. At each level of FPN [14], the anchors have 2 sub-octave scales (2k/2, for k1) and 3 aspect ratios [0.5, 1, 2]. We fix the batch normalization layers to be frozen in training phase. We adopt the minibatch training on 2 GPUs with 8 images per GPU. All evaluated models are trained for 160 epochs with an initial learning rate of 0.001 which is then divided by 10 at 110 epochs and again at 140 epochs. Weight decay of 0.0001 and momentum of 0.9 are used. We adopt the same data augmentation strategies as [18], while do not use any data augmentation during testing phase. In training phase, the input image is fixed to 512×512, while in testing phase, we maintain the original aspect ratio and resize the image to ensure the shorter side with 600 pixels. We apply the non-maximum suppression with IoU of 0.5 for each class.

MS COCO: All models are trained on the widely used trainval35k set (80k train images and 35k subset of val images), and tested on minival set (5k subset of val images) or test-dev set. We train the networks for 100 epochs with an initial learning rate of 0.001 which is then divided by 10 at 60 epochs and again at 80 epochs. Other details are similar to that for PASCAL VOC.

4.2 Ablation Study

We first investigate the impact of our design settings of the proposed framework. We fix the ResNet-50 as backbone and conduct several controlled experiments on PASCAL VOC2007 test set (and COCO minival if stated) for this ablation study.

4.2.1 Comparison on Different Parameter Settings

Training Loss PASCAL VOC COCO
AP AP50 AP75 AP AP50 AP75
CE-Loss + OHEM 49.1 81.5 51.5 30.8 50.9 32.6
Focal Loss 51.3 80.9 55.3 33.9 55.0 35.7
AUC-Loss 49.3 79.7 51.8 25.5 44.9 26.0
AP-Loss 53.1 82.3 58.1 35.0 57.2 36.6
Table 2: Comparison through different training losses. Models are tested on VOC2007 test and COCO minival sets. The metric AP is averaged over multiple IoU thresholds of 0.50:0.05:0.95.

Here we study the impact of the practical modifications introduced in Section 3.4. All results are shown in Table 1.

Minibatch Training: First, we study the mini-batch training, and report detector results at different batch-size in (a). It shows that larger batch-size (i.e. 8) outperforms all the other smaller batch-size. This verifies our previous hypothesis that large minibatch training helps to eliminate the “score-shift” from different images, and thus stabilizes the AP-loss through robust gradient calculation. Hence, batch-size=8 is used in our further studies.

Piecewise Step Function: Second, we study the piecewise step function, and report detector performance on the piecewise step function with different δ in (b). As mentioned before, we argue that the choice of δ is trivial and is dependent on other network hyper-parameters such as weight decay. Smaller δ makes the function sharper, which yields unstable training at initial phase. Larger δ makes the function deviate from the properties of the original AP-loss, which also worsens the performance. δ=1 is a good choice we used in our further studies.

Interpolated AP: Third, we study the impact of interpolated AP in our optimization algorithm, and list the results in (c). Marginal benefits are observed for interpolated AP over standard AP, so we use interpolated AP in all the following studies.

4.2.2 Comparison on Different Losses

We evaluate with different losses on RetinaNet [15]. Results are shown in Table 2. We compare traditional classification based losses like focal loss [15] and cross entropy loss (CE-loss) with OHEM [18] to the ranking based losses like AUC-loss and AP-loss. Although focal loss is significantly better than CE-loss with OHEM on COCO dataset, it is interesting that focal-loss does not perform better than CE-loss at AP50 on PASCAL VOC. This is likely because the hyper-parameters of focal loss are designed to suit the imbalance condition on COCO dataset which is not suitable for PASCAL VOC, so that focal loss cannot generalize well to PASCAL VOC without tuning its hyper-parameters. The proposed AP-loss performs much better than all the other losses on both two datasets, which demonstrates its effectiveness and stronger generalization ability on handling the imbalance issue. It is worth noting that AUC-loss performs much worse than AP-loss, which may be due to the fact that AUC has equal penalty for each misordered pair while AP imposes greater penalty for the misordering at higher positions in the predicted ranking. It is obvious that object detection evaluation concerns more on objects with higher confidence, which is why AP provides a better loss measure. Furthermore, an assessment of the detection performance at different training iterations, as shown in (a), outlines the superiority of the AP-loss for snapshot time points.

(a)
(b)
Figure 5: (a) Detection accuracy (mAP) on VOC2007 test set. (b) Convergence curves of different AP-loss optimizations on VOC2007 trainval set. (Best viewed in color)
Figure 6: Some detection examples. Top: Baseline results by RetinaNet with focal loss. Bottom: Our results with AP-loss.
Method Backbone Multi-Scale VOC07 VOC12 COCO
AP50 AP50 AP AP50 AP75 APS APM APL
YOLOv2 [24] DarkNet-19 78.6 73.4 21.6 44.0 19.2 5.0 22.4 35.5
DSOD300 [30] DS/64-192-48-1 77.7 76.3 29.3 47.3 30.6 9.4 31.5 47.0
SSD512 [18] VGG-16 79.8 78.5 28.8 48.5 30.3 - - -
SSD513 [6] ResNet-101 80.6 79.4 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [6] ResNet-101 81.5 80.0 33.2 53.3 35.2 13.0 35.4 51.1
DES512 [39] VGG-16 81.7 80.3 32.8 53.2 34.6 13.9 36.0 47.6
RFBNet512 [17] VGG-16 82.2 - 33.8 54.2 35.9 16.2 37.1 47.4
PFPNet-R512 [10] VGG-16 82.3 80.3 35.2 57.6 37.9 18.7 38.6 45.9
RefineDet512 [38] VGG-16 81.8 80.1 33.0 54.5 35.5 16.3 36.3 44.3
RefineDet512 [38] ResNet-101 - - 36.4 57.5 39.5 16.6 39.9 51.4
RetinaNet500 [15] ResNet-101 - - 34.4 53.1 36.8 14.7 38.5 49.1
RetinaNet500+AP-Loss (ours) ResNet-101 83.9 83.1 37.4 58.6 40.5 17.3 40.8 51.9
PFPNet-R512 [10] VGG-16 84.1 83.7 39.4 61.5 42.6 25.3 42.3 48.8
RefineDet512 [38] VGG-16 83.8 83.5 37.6 58.7 40.8 22.7 40.3 48.3
RefineDet512 [38] ResNet-101 - - 41.8 62.9 45.7 25.6 45.1 54.1
RetinaNet500+AP-Loss (ours) ResNet-101 84.9 84.5 42.1 63.5 46.4 25.6 45.0 53.9
Table 3: Detection results on VOC2007 test, VOC 2012 test and COCO test-dev sets.

4.2.3 Comparison on Different Optimization Methods

We also compare our optimization method with the approximate gradient method [34, 9] and structured hinge loss method [20]. Both [34, 9] approximate the AP-loss with a smooth expectation and envelope function, respectively. Following their guidance, we replace the step function in AP-loss with a sigmoid function to constrain the gradient to neither zero nor undefined, while still keep the shape similar to the original function. Same as [9], we adopt the log space objective function, i.e. log(AP+ϵ), to allow the model to quickly escape from the initial state. We train the detector on VOC2007 trainval set and turn off the bounding box regression task. The convergence curves shown in (b) reveal some essential observations. It can be seen that AP-loss optimized by approximate gradient method does not even converge, likely because its non-convexity and non-quasiconvexity fail on a direct gradient descent method. Meanwhile, AP-loss optimized by the structured hinge loss method [20] converges slowly and stabilizes near 0.8, which is significantly worse than the asymptotic limit of AP-loss optimized by our error-driven update scheme. We believe that this method does not optimize the AP-loss directly but rather an upper bound of it, which is controlled by a discriminant function [20]. In ranking task, this discriminant function is hand-picked and has an AUC-like form, which may cause variability in optimization.

4.3 Benchmark Results

With the settings selected in ablation study, we conduct experiments to compare the proposed detector to state-of-the-art one-stage detectors on three widely used benchmark, i.e. VOC2007 test, VOC2012 test and COCO test-dev sets. We use ResNet-101 as backbone networks instead of ResNet-50 in ablation study. We use an image scale of 500 pixels for testing. Table 3 lists the benchmark results comparing to recent state-of-the-art one-stage detectors such as SSD [18], YOLOv2 [24], DSSD [6], DSOD [30], DES [39], RetinaNet [15], RefineDet [38], PFPNet [10], RFBNet [17]. Compared to the baseline model RetinaNet500 [15], our detector achieves a 3.0% improvement (37.4% vs. 34.4%) on COCO dataset. Figure 6 illustrates some detection results by the RetinaNet with focal loss and our AP-loss. Besides, our detector outperforms all the other methods for both single-scale and multi-scale tests in all the three benchmarks. We should emphasize that this verifies the great effectiveness of our AP-loss since our detector achieves such a great performance gain just by replacing the focal-loss with our AP-loss in RetinaNet without whistle and bells, without using advanced techniques like deformable convolution [3], SNIP [33], group normalization [36], etc. The performance could be further improved with these kinds of techniques and other possible tricks. Our detector has the same detection speed (i.e., 11fps on one NVidia TitanX GPU) as RetinaNet500 [15] since it does not change the network architecture for inference.

5 Conclusion

In this paper, we address the class imbalance issue in one-stage object detectors by replacing the classification sub-task with a ranking sub-task, and proposing to solve the ranking task with AP-Loss. Due to non-differentiability and non-convexity of the AP-loss, we propose a novel algorithm to optimize it based on error-driven update scheme from perceptron learning. We provide a grounded theoretical analysis of the proposed optimization algorithm. Experimental results show that our approach can significantly improve the state-of-the-art one-stage detectors.

Acknowledgements. This paper is supported in part by: National Natural Science Foundation of China (61471235), Shanghai ’The Belt and Road’ Young Scholar Exchange Grant (17510740100), CREST Malaysia (No. T03C1-17), and the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation. We gratefully acknowledge the support from Tencent YouTu Lab.

References

  • [1] JK Anlauf and M Biehl. The adatron: an adaptive perceptron algorithm. EPL (Europhysics Letters), 10(7):687, 1989.
  • [2] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, pages 379–387, 2016.
  • [3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, et al. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
  • [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
  • [5] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
  • [6] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
  • [7] Ross Girshick. Fast r-cnn. In ICCV, pages 1440–1448, 2015.
  • [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [9] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In ACCV, pages 198–213, 2016.
  • [10] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyramid network for object detection. In ECCV, pages 234–250, 2018.
  • [11] Werner Krauth and Marc Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A: Mathematical and General, 20(11):L745, 1987.
  • [12] Jianguo Li and Y Zhang. Learning surf cascade for fast and accurate object detection. In CVPR, 2013.
  • [13] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. Tiny-DSOD: Lightweight object detection for resource-restricted usages. In BMVC, 2018.
  • [14] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 3, 2017.
  • [15] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Trans on PAMI, 2018.
  • [16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [17] Songtao Liu, Di Huang, and andYunhong Wang. Receptive field block net for accurate and fast object detection. In The European Conference on Computer Vision (ECCV), September 2018.
  • [18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
  • [19] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Efficient optimization for average precision svm. In NIPS, pages 2312–2320, 2014.
  • [20] Pritish Mohapatra, Michal Rolinek, C.V. Jawahar, Vladimir Kolmogorov, and M. Pawan Kumar. Efficient optimization for rank-based loss functions. In CVPR, 2018.
  • [21] A Novikoff. On convergence proofs for perceptrons. Proc.sympos.math.theory of Automata, pages 615–622, 1963.
  • [22] Yongming Rao, Dahua Lin, Jiwen Lu, and Jie Zhou. Learning globally optimized object detector via policy gradient. In CVPR, pages 6190–6198, 2018.
  • [23] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [24] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
  • [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [26] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
  • [27] Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986.
  • [28] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
  • [29] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  • [30] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
  • [31] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
  • [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [33] Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection–snip. In CVPR, 2018.
  • [34] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct loss minimization. In ICML, pages 2169–2177, 2016.
  • [35] A Wendemuth. Learning the unlearnable. Journal of Physics A: Mathematical and General, 28(18):5423, 1995.
  • [36] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
  • [37] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR, pages 271–278. ACM, 2007.
  • [38] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In CVPR, 2018.
  • [39] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L. Yuille. Single-shot object detection with enriched semantics. In CVPR, 2018.

A1. Convergence

We provide proof for the proposition mentioned in Section 3.3.1 of the paper. The proof is generalized from the original convergence proof [21] for perceptron learning algorithm.

Proposition 2

The AP-loss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;

(2) the training data is linearly separable.

Proof. Let 𝜽 denote the weights of the linear model. Let 𝒇k(n) denote the feature vector of k-th box in n-th training sample. Assume the number of training samples is finite and each training sample contains at most M boxes. Hence the score of k-th box is sk(n)=𝒇k(n),𝜽. Define xij(n)=-(si(n)-sj(n)). Note that the training data is separable, which means there are ϵ>0 and 𝜽* that satisfy:

n,i𝒫(n),j𝒩(n),𝒇i(n),𝜽*𝒇j(n),𝜽*+ϵ (15)

In the t-th step, a training sample which makes an error (if there is no such training sample, the model is already optimal and algorithm will stop) is randomly chosen. Then the update of 𝜽 is:

𝜽(t+1)=𝜽(t)+i𝒫j𝒩Lij(𝒙)(𝒇i-𝒇j) (16)

where

Lij(𝒙)=H(xij)1+kiH(xik) (17)

Here, since the discussion centers on the current training sample, we omit the superscript hereon.

From (16), we have

𝜽(t+1),𝜽*=𝜽(t),𝜽*+i𝒫j𝒩Lij(𝒇i-𝒇j),𝜽*𝜽(t),𝜽*+i𝒫j𝒩Lijϵ𝜽(t),𝜽*+maxi𝒫,j𝒩{Lij}ϵ𝜽(t),𝜽*+1|𝒫|+|𝒩|ϵ𝜽(t),𝜽*+1Mϵ (18)

For convenience, let 𝜽(0)=0 (if 𝜽(0)0, we can still find a c>0 that satisfies (20) for sufficiently large t), we have

𝜽(t),𝜽*1Mϵt (19)

Then

𝜽(t)𝜽(t),𝜽*𝜽*1M𝜽*ϵtct (20)

Here, c is a positive constant.

From (16), we also have

𝜽(t+1)2=𝜽(t)2+i𝒫j𝒩Lij(𝒇i-𝒇j)2+2i𝒫j𝒩Lij(𝒇i-𝒇j),𝜽(t)=𝜽(t)2+i𝒫j𝒩Lij(𝒇i-𝒇j)2+2i𝒫j𝒩Lijxji𝜽(t)2+i𝒫j𝒩Lij(𝒇i-𝒇j)2𝜽(t)2+|𝒫||𝒩|maxi𝒫,j𝒩{𝒇i-𝒇j2}𝜽(t)2+M2maxn,i,j{𝒇i(n)-𝒇j(n)2}𝜽(t)2+C (21)

Here, C is a positive constant. Let 𝜽(0)=0 (again, if 𝜽(0)0, we can still find a C>0 that satisfies (22) for sufficiently large t), we arrive at:

𝜽(t)2Ct (22)

Then, combining (20) and (22), we have

c2t2𝜽(t)2Ct (23)

which means

tCc2 (24)

It shows that the algorithm will stop at most after C/c2 steps, which means that the training model will achieve the optimal solution in finite steps.

A2. An Example of Gradient Descent Failing on Smoothed AP-loss

We approximate the step function in AP-loss by sigmoid function to make it amenable to gradient descent. Specifically, the smoothed AP-loss function is given by:

F=1|𝒫|i𝒫j𝒩S(xij)1+kiS(xik) (25)

where

S(x)=ex1+ex (26)

Consider a linear model s=f1θ1+f2θ2 and three training samples (0,0),(1,0),(-3,1) (the first one is negative sample, others are positive samples). Then we have

s(1)=0θ1+0θ2s(2)=1θ1+0θ2s(3)=-3θ1+1θ2 (27)

Note that the training data is separable since we have s(2)>s(1) and s(3)>s(1) when 0<θ1<13θ2.

Under this setting, the smoothed AP-loss become

F(θ1,θ2)=12(S(-θ1)1+S(-θ1)+S(θ2-4θ1)+S(3θ1-θ2)1+S(4θ1-θ2)+S(3θ1-θ2)) (28)

If θ1 is sufficiently large and θ1>θ2>0, then the partial derivatives satisfy the following condition:

Fθ1<Fθ2<0 (29)

which means θ1 and θ2 will keep increasing with the inequality θ1>θ2 according to the gradient descent algorithm. Hence the objective function F will approach 1/6 here. However, the objective function F approaches the global minimum value 0 if and only if θ1+ and θ2-3θ1+. This shows that the gradient descent fails to converge to global minimum in this case.

A3. Inseparable Case

In this section, we will provide analysis for our algorithm with inseparable training data. We demonstrate that the bound of accumulated AP-loss depends on the best performance of learning model. The analysis is based on online learning bounds [29].

A3.1. Preliminary

To handle the inseparable case, a mild modification on the proposed algorithm is needed, i.e. in the error-driven update scheme, Lij is modified to

L~ij=H~(xij)1+k𝒫𝒩,kiH(xik) (30)

where H~() is defined in Section 3.4.2 (Piecewise Step Function) of the paper. The purpose is to introduce a non-zero decision margin for the pairwise score xij which makes the algorithm more robust in the inseparable case. In contrast to the case in Section 3.4.2, here we only change H() to H~() in the numerator for the convenience of theoretical anaysis. However, such algorithm still suffers from the discontinuity of H() in the denominator. Hence the strategy in Section 3.4.2 is also practical consideration, necessary for good performance. Then, consider the AP-loss:

AP(𝒙;𝒫,𝒩)=1|𝒫|i𝒫j𝒩H(xij)1+j𝒫𝒩,jiH(xij) (31)

and define a surrogate loss function:

l(𝒙,𝒙^;𝒫,𝒩)=1|𝒫|i𝒫j𝒩Q(xij)1+j𝒫𝒩,jiH(x^ij) (32)

where Q(x)=-xH~(υ)𝑑υ. Note that the AP-loss is upper bounded by the surrogate loss:

l(𝒙,𝒙;𝒫,𝒩)δ4AP(𝒙;𝒫,𝒩) (33)

The learning model can be written as 𝒙=𝑿d(𝜽), where d𝒟 denotes the training data for one iteration and D is the whole training set. Then, the modified error-driven algorithm is equivalent to gradient descent on surrogate loss l(𝑿d(t)(𝜽),𝑿d(t)(𝜽(t));𝒫d(t),𝒩d(t)) at each step t. We further suppose below conditions are satisfied:

(1) For all 𝜽^ and d𝒟, l(𝑿d(𝜽),𝑿d(𝜽^);𝒫d,𝒩d) is convex w.r.t 𝜽.

(2) For all d𝒟, 𝑿d(𝜽)/𝜽 is upper bounded by a constant R. Here is the matrix norm induced by the 2-norm for vectors.

Remark 1. Note that these two conditions are satisfied if the learning model is linear.

A3.2. Bound of Accumulated Loss

By the convexity, we have:

l(t)(𝜽)l(t)(𝒖)+𝜽-𝒖,l(t)(𝜽)𝜽 (34)

where l(t)(𝜽) denotes l(𝑿d(t)(𝜽),𝑿d(t)(𝜽(t));𝒫d(t),𝒩d(t)) and 𝒖 can be any vector of model weights. Then, let 𝜽=𝜽(t) and compute the sum over t=1T, we have:

t=1Tl(t)(𝜽(t))-t=1Tl(t)(𝒖)t=1T𝜽(t)-𝒖,l(t)(𝜽(t))𝜽=t=1T𝜽(t)-𝒖,1η(𝜽(t)-𝜽(t+1))12η𝒖-𝜽(1)2+12ηt=1T𝜽(t)-𝜽(t+1)2=12η𝒖-𝜽(1)2+η2t=1Tl(t)(𝜽(t))𝜽2 (35)

where η is the step size of gradient descent. Note that

l(t)(𝜽(t))𝜽=𝑿(𝜽)𝜽|𝜽=𝜽(t)l(𝒙,𝒙(t))𝒙|𝒙=𝑿(𝜽(t)) (36)

and

l(𝒙,𝒙(t))𝒙2=1|𝒫|2i𝒫j𝒩H~2(xij)(1+j𝒫𝒩,jiH(xij(t)))21|𝒫|2i𝒫1δj𝒩Q(xij)1+j𝒫𝒩,jiH(xij(t))1δl(𝒙,𝒙(t)) (37)

Note that both 𝒫d and 𝒩d depend on d. However, we omit the subscript d here since the discussion only centers on the current training sample d(t).

Hence we have:

t=1Tl(t)(𝜽(t))-t=1Tl(t)(𝒖)12η𝒖-𝜽(1)2+ηR22δt=1Tl(t)(𝜽(t)). (38)

Let η=δ/R2, rearrange and get the expression:

12t=1Tl(t)(𝜽(t))R22δ𝒖-𝜽(1)2+t=1Tl(t)(𝒖) (39)

This entails the bound of surrogate loss l:

t=1Tl(t)(𝜽(t))2t=1Tl(t)(𝒖)+R2δ𝒖-𝜽(1)2 (40)

which implies the bound of AP-loss AP:

t=1TAP(𝑿(𝜽(t)))8δt=1Tl(t)(𝒖)+4R2δ2𝒖-𝜽(1)2 (41)

As a special case, if there exists a 𝒖 such that l(t)(𝒖)=0 for all t, then the accumulated AP-loss is bounded by a constant, which implies that convergence can be achieved with finite steps (similar to that of the separable case). Otherwise, with sufficiently large T, the average AP-loss mainly depends on 1T8δt=1Tl(t)(𝒖). This implies that the bound is meaningful if there still exists a sufficiently good solution 𝒖 in such inseparable case.

A3.3. Offline Setting

With the offline setting (d(t)=d for all t), a bound with simpler form can be revealed. For simplicity, we will omit the subscript d of 𝑿d(𝒖),𝒫d,𝒩d and define Ai(𝒖)=j𝒩Q(Xij(𝒖)), Z(𝒖)=maxi𝒫{Ai(𝒖)}. Then,

l(t)(𝒖)=1|𝒫|i𝒫j𝒩Q(Xij(𝒖))1+j𝒫𝒩,jiH(Xij(𝜽(t)))=1|𝒫|i𝒫Ai(𝒖)1+j𝒫𝒩,jiH(Xij(𝜽(t)))Z(𝒖)|𝒫|i=1|𝒫|1iln|𝒫|+1|𝒫|Z(𝒖) (42)

The second last inequality is based on the fact that (1+j𝒫𝒩,jiH(Xij(𝜽(t)))) are picked from 1(|𝒫|+|𝒩|) without replacement (assume no ties; if ties exist, this inequality still holds). Combining the results from Equation 42 and Equation 41, we have:

1Tt=1TAP(𝑿(𝜽(t)))ln|𝒫|+1|𝒫|8δZ(𝒖)+1T4R2𝒖-𝜽(1)2δ2 (43)

Next,

l(t)(𝒖)=1|𝒫|i𝒫Ai(𝒖)1+j𝒫𝒩,jiH(Xij(𝜽(t)))=1|𝒫|i𝒫1+j𝒫,jiH(Xij(𝜽(t)))1+j𝒫𝒩,jiH(Xij(𝜽(t)))Ai(𝒖)1+j𝒫,jiH(Xij(𝜽(t)))1|𝒫|i𝒫1+j𝒫,jiH(Xij(𝜽(t)))1+j𝒫𝒩,jiH(Xij(𝜽(t)))Z(𝒖)=(1-AP(𝑿(𝜽(t))))Z(𝒖) (44)

Combining the results from Equation 44 and Equation 41, we have:

1Tt=1TAP(𝑿(𝜽(t)))8δZ(𝒖)1+8δZ(𝒖)+1T4R2𝒖-𝜽(1)2δ2 (45)

If Z(𝒖) is small, the bound in Equation 43 is active, otherwise the bound in Equation 45 is active. Consequently, we have:

AP¯min{ln|𝒫|+1|𝒫|8δZ(𝒖),8δZ(𝒖)1+8δZ(𝒖)}+ϵ (46)

where AP¯ denotes the average AP-loss, ϵ0 as T increases.

A4. Consistency

Observation 2

When the activation function L() takes the form of softmax function and loss-augmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on cross-entropy loss and hinge loss respectively.

Cross Entropy Loss: Consider the multi-class classification task. The outputs of neural network are (x1,,xK) where K is the number of classes, and the ground truth label is y{1,,K}. Using softmax as the activation function, we have:

(L1,,LK)=softmax(𝒙)=(ex1iexi,,exKiexi) (47)

The cross entropy loss is:

ce=-i𝟏y=ilog(Li) (48)

Hence the gradient of xi is

gi=Li-𝟏y=i (49)

Note that gi is “error-driven” with the desired output 𝟏y=i and current output Li. This form is consistent with our error-driven update scheme (c.f. Section 3.2.1 of the paper).

Hinge Loss: Consider the binary classification task. The output of neural network is x, and the ground truth label is y{1,2}. Define (x1,x2)=(-x,x). Using loss-augmented step function as the activation function, we have:

(L1,L2)=(H(x1-1),H(x2-1)) (50)

where H() is the Heaviside step function. The hinge loss is:

hinge=𝟏y=1max{1-x1,0}+𝟏y=2max{1-x2,0} (51)

Hence the gradient of xi is

gi=𝟏y=i(Li-1) (52)

There are two cases. If y=i, the gradient gi is “error-driven” with the desired output 1 and current output Li. If yi, the gradient gi equals zero, since xi does not contribute to the loss. This form is consistent with our error-driven update scheme (c.f. Section 3.2.1 of the paper).

A5. Additional Experiments on SSD

A5.1. Experimental Settings

We also evaluate the proposed AP-loss on another one-stage detector SSD [18]. The models are trained on VOC2007 and VOC2012 trainval sets, and tested on VOC2007 test set. We use VGG-16 [32] as the backbone model which is pre-trained on the ImageNet-1k classification dataset [4]. We use conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2, conv12_2 to predict both location and their corresponding confidences. An additional convolution layer is added after conv4_3 to scale the feature. The associated anchors are the same as that designed in [18]. In testing phase, the input image is fixed to 512×512. For focal loss with SSD, we observe that the hyper-parameters γ=1,α=0.25 lead to a much better performance than the original settings in [15] which are γ=2,α=0.25. Hence we evaluate the focal loss with new γ and α in our experiments on SSD. Other details are similar to that in Section 4.1 of the paper.

A5.2. Results

Training Loss PASCAL VOC
AP AP50 AP75
CE-Loss + OHEM 43.6 76.0 44.7
Focal Loss 39.3 69.9 38.0
AUC-Loss 33.8 63.7 31.5
AP-Loss 45.2 77.3 47.3
Table 4: Comparison through different training losses. Models are tested on VOC2007 test set. The metric AP is averaged over multiple IoU thresholds of 0.50:0.05:0.95.
Figure 7: Detection accuracy (mAP) on VOC2007 test set.(Best viewed in color)

The results are shown in Table 4 and Figure 7. Note that the AP-loss outperforms all the other losses at both the final state and various snapshot time points. Together with the results on RetinaNet [15] in Section 4.2.2 of the paper, we observe the robustness of the proposed AP-loss, which performs much better than the other competing losses on different datasets (i.e. PASCAL VOC [5], MS COCO [16]) and different detectors (i.e. RetinaNet [15], SSD [18]). This demonstrates the effectiveness and strong generalization ability of our proposed approach.