Abstract
Onestage object detectors are trained by optimizing classificationloss andlocalizationloss simultaneously, with the former suffering much from extremeforegroundbackground class imbalance issue due to the large number of anchors.This paper alleviates this issue by proposing a novel framework to replace theclassification task in onestage detectors with a ranking task, and adoptingthe AveragePrecision loss (APloss) for the ranking problem. Due to itsnondifferentiability and nonconvexity, the APloss cannot be optimizeddirectly. For this purpose, we develop a novel optimization algorithm, whichseamlessly combines the errordriven update scheme in perceptron learning andbackpropagation algorithm in deep networks. We verify good convergence propertyof the proposed algorithm theoretically and empirically. Experimental resultsdemonstrate notable performance improvement in stateoftheart onestagedetectors based on APloss over different kinds of classificationlosses onvarious benchmarks, without changing the network architectures. Code isavailable at https://github.com/cccorn/APloss.
Quick Read (beta)
Towards Accurate OneStage Object Detection with APLoss
Abstract
Onestage object detectors are trained by optimizing classificationloss and localizationloss simultaneously, with the former suffering much from extreme foregroundbackground class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in onestage detectors with a ranking task, and adopting the AveragePrecision loss (APloss) for the ranking problem. Due to its nondifferentiability and nonconvexity, the APloss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm, which seamlessly combines the errordriven update scheme in perceptron learning and backpropagation algorithm in deep networks. We verify good convergence property of the proposed algorithm theoretically and empirically. Experimental results demonstrate notable performance improvement in stateoftheart onestage detectors based on APloss over different kinds of classificationlosses on various benchmarks, without changing the network architectures. Code is available at https://github.com/cccorn/APloss.
1 Introduction
Object detection needs to localize and recognize the objects simultaneously from the large backgrounds, which remains challenging due to the imbalance between foreground and background. Deep learning based detection solutions usually adopt a multitask architecture, which handles classification task and localization task with different loss functions. The classification task aims to recognize the object in a given box, while the localization task aims to predict the precise bounding box of the object. Twostage detectors [25, 7, 2, 14] first generates a limited number of object box proposals, so that the detection problem can be solved by adopting classification task on those proposals. However, the circumstance is different for onestage detectors, which need to predict the object class directly from the densely predesigned candidate boxes. The large number of boxes yield the imbalance between foreground and background which makes the optimization of classification task easily biased and thus impacts the detection performance. It is observed that the classification metric could be very high for a trivial solution which predicts negative label for almost all candidate boxes, while the detection performance is poor. (a) illustrates one such example.
To tackle this issue in onestage object detectors, some works introduce new classification losses such as balanced loss [23], Focal Loss [15], as well as tailored training method such as Online Hard Example Mining (OHEM) [18, 31]. These losses model each sample (anchor box) independently, and attempt to reweight the foreground and background samples in classification losses to cater for the imbalance condition; this is done without considering the relationship among different samples. The designed balance weights are handcrafted hyperparameters, which do not generalize well across datasets. We argue that the gap between classification task and detection task hinder the performance of onestage detectors. In this paper, instead of modifying the classification loss, we propose to replace classification task with ranking task in onestage detectors, where the associated ranking loss explicitly models sample relationship, and is invariant to the ratio of positive and negative samples. As shown in (b), we adopt Average Precision (AP) as our target loss which is inherently more consistent with the evaluation metric for object detection.
However, it is nontrivial to directly optimize the APloss due to the nondifferentiability and nondecomposability, so that standard gradient descent methods are not amenable for this case. There are three aspects of studies for this issue. First, AP based loss is studied within structured SVM models [37, 19], which restricts in linear SVM model so that the performance is limited. Second, a structured hinge loss [20] is proposed to optimize the upper bound of APloss instead of the loss itself. Third, approximate gradient methods [34, 9] are proposed for optimizing the APloss, which are less efficient and easy to fall into local optimum even for the case of linear models due to the nonconvexity and nonquasiconvexity of the APloss. Therefore, it is still an open problem for the optimization of the APloss.
In this paper, we address this challenge by replacing the classification task in onestage detectors with a ranking task, so that we handle the class imbalance problem with a ranking based loss named APloss. Furthermore, we propose a novel errordriven learning algorithm to effectively optimize the nondifferentiable AP based objective function. More specifically, some extra transformations are added to the score output of onestage detector to obtain the APloss, which includes a linear transformation that transforms the scores to pairwise differences, and a nonlinear and nondifferentiable “activation function” that transform the pairwise differences to the primary terms of the APloss. Then the APloss can be obtained by the dot product between the primary terms and the label vector. It is worth noting that the difficulty for using gradient method on the APloss lies in passing gradients through the nondifferentiable activation function. Inspired by the perceptron learning algorithm [26], we adopt an errordriven learning scheme to directly pass the update signal through the nondifferentiable activation function. Different from gradient method, our learning scheme gives each variable an update signal proportional to the error it makes. Then, we adopt the backpropagation algorithm to transfer the update signal to the weights of neural network. We theoretically and experimentally prove that the proposed optimization algorithm does not suffer from the nondifferentiability and nonconvexity of the objective function. The main contributions of this paper are summarized as below:

•
We propose a novel framework in onestage object detectors which adopts the ranking loss to handle the class imbalance issue.

•
We propose an errordriven learning algorithm that can efficiently optimize the nondifferentiable and nonconvex APbased objective function with both theoretical and experimental verifications.

•
We show notable performance improvement with the proposed method on stateoftheart onestage detectors over different kinds of classificationlosses without changing the model architecture.
2 Related Work
Onestage detectors: In object detection, the onestage approaches have relatively simpler architecture and higher efficiency than twostage approaches. OverFeat [28] is one of the first CNNbased onestage detectors. Thereafter, different designs of onestage detectors are proposed, including SSD [18], YOLO [23], DSSD [6] and DSOD [30, 13]. These methods demonstrate good processing efficiency as onestage detectors, but generally yield lower accuracy than twostage detectors. Recently, RetinaNet [15] and RefineDet [38] narrow down the performance gap (especially on the challenging COCO benchmark [16]) between onestage approaches and twostage approaches with some innovative designs. As commonly known, the performance of onestage detectors benefits much from densely designed anchors, which introduce extreme imbalance between foreground and background samples. To address this challenge, methods like OHEM [18, 31] and Focal Loss [15] have been proposed to reduce the loss weight for easy samples. However, there are two hurdles that are still open to discussion. Firstly, handcrafted hyperparameters for weight balance do not generalize well across datasets. Secondly, the relationship among sample anchors is far from well modeled.
AP as a loss for Object Detection: Average Precision (AP) is widely used as the evaluation metric in many tasks such as object detection [5] and information retrieval [27]. However, AP is far from a good and common choice as an optimization goal in object detection due to its nondifferentiability and nonconvexity. Some methods have been proposed to optimize the APloss in object detection, such as APloss in the linear structured SVM model [37, 19], structured hinge loss as upper bound of the APloss [20], approximate gradient methods [34, 9], reinforcement learning to finetune a pretrained object detector with AP based metric [22]. Although these methods give valuable results in optimizing the APloss, their performances are still limited due to the intrinsic limitations. In details, the proposed approach differs from them in 4 aspects. (1) Our approach can be used for any differentiable linear or nonlinear models such as neural networks, while [37, 19] only work for linear SVM model. (2) Our approach directly optimizes the APloss, while [20] introduces notable loss gap after relaxation. (3) Our approach dose not approximate the gradient and dose not suffer from the nonconvexity of objective function as in [34, 9]. (4) Our approach can train the detectors in an endtoend way, while [22] cannot.
Perceptron Learning Algorithm: The core of our optimization algorithm is the “errordriven update” which is generalized from the perceptron learning algorithm [26], and helps overcome the difficulty of the nondifferentiable objective functions. The perceptron is a simple artificial neuron using the Heaviside step function as the activation function. The learning algorithm was first invented by Frank Rosenblatt [26]. As the Heaviside step function in perceptron is nondifferentiable, it is not amenable for gradient method. Instead of using a surrogate loss like crossentropy, the perceptron learning algorithm employs an errordriven update scheme directly on the weights of neurons. This algorithm is guaranteed to converge in finite steps if the training data is linearly separable. Further works like [11, 1, 35] have studied and improved the stability and robustness of the perceptron learning algorithm.
3 Method
We aim to replace the classification task with APloss based ranking task in onestage detectors such as RetinaNet [15]. Figure 2 shows the two key components of our approach, i.e., the ranking procedure and the errordriven optimization algorithm. Below, we will first present how APloss is derived from traditional score output. Then, we will introduce the errordriven optimization algorithm. Finally, we also present the theoretical analyses of the proposed optimization algorithm and outline the training details. Note that all changes are made on the loss part of the classification branch without changing the backbone model and localization branch.
3.1 Ranking Task and APLoss
3.1.1 Ranking Task
In traditional onestage detectors, given input image $I$, suppose the predefined boxes (also called anchors) set is $B$, each box ${b}_{i}\in B$ will be assigned a label ${t}_{i}\in \{1,0,1,\mathrm{\dots},K\}$ based on ground truth and the IoU strategy [7, 25], where label $1\sim K$ means the object class ID, label “0” means background and label “$1$” means ignored boxes. During training and testing phase, the detector outputs a scorevector $({s}_{i}^{0},\mathrm{\cdots},{s}_{i}^{K})$ for each box ${b}_{i}$.
In our framework, instead of one box with $K+1$ dimensional score predictions, we replicate each box ${b}_{i}$ for $K$ times to obtain ${b}_{ik}$ where $k=1,\mathrm{\cdots},K$, and the $k$th box is responsible for the $k$th class. Each box ${b}_{ik}$ will be assigned a label ${t}_{ik}\in \{1,0,1\}$ through the same IoU strategy (label $1$ for not counted into the ranking loss). Therefore, in the training and testing phase, the detector will predict only one scalar score ${s}_{ik}$ for each box ${b}_{ik}$. Figure 3 illustrates our label formulation and the difference to traditional case.
The ranking task dictates that every positive boxes should be ranked higher than all the negative boxes w.r.t their scores. Note that AP of our ranking result is computed over the scores from all classes. This is slightly different from the evaluation metric meanAP for object detection systems, which computes AP for each class and obtains the average value. We compute AP this way because the score distribution should be unified for all classes while ranking each class separately cannot achieve this goal.
3.1.2 APLoss
For simplicity, we still use $B$ to denote the anchor box set after replication, and ${b}_{i}$ to denote the $i$th anchor box without the replication subscript. Each box ${b}_{i}$ thus corresponds to one scalar score ${s}_{i}$ and one binary label ${t}_{i}$. Some transformations are required to formulate a ranking loss as illustrated in Figure 2. First, the difference transformation transfers the score ${s}_{i}$ to the difference form
$$\forall i,j,{x}_{ij}=\left(s({b}_{i};\bm{\theta})s({b}_{j};\bm{\theta})\right)=\left({s}_{i}{s}_{j}\right)$$  (1) 
where $s({b}_{i};\bm{\theta})$ is a CNN based score function with weights $\bm{\theta}$ for box ${b}_{i}$. The ranking label transformation transfers labels ${t}_{i}$ to the corresponding pairwise ordering form
$$\forall i,j,{y}_{ij}={\mathrm{\U0001d7cf}}_{{t}_{i}=1,{t}_{j}=0}$$  (2) 
where $\mathrm{\U0001d7cf}$ is a indicator function which equals to 1 only if the subscript condition holds (i.e., ${t}_{i}=1,{t}_{j}=0$), otherwise $0$. Then, we define an vectorvalued activation function $\bm{L}(\cdot )$ to produce the primary terms of the APloss as
$${L}_{ij}\left(\bm{x}\right)=\frac{H\left({x}_{ij}\right)}{1+{\sum}_{k\in \mathcal{P}\cup \mathcal{N},k\ne i}H\left({x}_{ik}\right)}={L}_{ij}$$  (3) 
where $H(\cdot )$ is the Heaviside step function:
$$  (4) 
A ranking is denoted as proper ranking when there are no two samples scored equally (i.e., $\forall i\ne j,{s}_{i}\ne {s}_{j}$). Without loss of generality, we will treat all rankings as a proper ranking by breaking ties arbitrarily. Now, we can formulate the APloss ${\mathcal{L}}_{AP}$ as
$$\begin{array}{cc}& {\mathcal{L}}_{AP}=1\text{AP}=1\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{ran{k}^{+}\left(i\right)}{rank\left(i\right)}\hfill \\ & =1\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{1+{\sum}_{j\in \mathcal{P},j\ne i}H\left({x}_{ij}\right)}{1+{\sum}_{j\in \mathcal{P},j\ne i}H\left({x}_{ij}\right)+{\sum}_{j\in \mathcal{N}}H\left({x}_{ij}\right)}\hfill \\ & =\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}=\frac{1}{\left\mathcal{P}\right}\sum _{i,j}{L}_{ij}\cdot {y}_{ij}=\frac{1}{\left\mathcal{P}\right}\u27e8\bm{L}\left(\bm{x}\right),\bm{y}\u27e9\hfill \end{array}$$  (5) 
where $rank(i)$ and $ran{k}^{+}(i)$ denote the ranking position of score ${s}_{i}$ among all valid samples and positive samples respectively, $\mathcal{P}=\{i{t}_{i}=1\}$, $\mathcal{N}=\{i{t}_{i}=0\}$, $\mathcal{P}$ is the size of set $\mathcal{P}$, $\bm{L}$ and $\bm{y}$ are vector form for all ${L}_{ij}$ and ${y}_{ij}$ respectively, $\u27e8,\u27e9$ means dotproduct of two input vectors. Note that $\bm{x},\bm{y},\bm{L}\in {\mathbb{R}}^{d}$, where $d={(\mathcal{P}+\mathcal{N})}^{2}$.
Finally, the optimization problem can be written as:
$$\underset{\bm{\theta}}{\mathrm{min}}{\mathcal{L}}_{AP}\left(\bm{\theta}\right)=1\text{AP}\left(\bm{\theta}\right)=\frac{1}{\left\mathcal{P}\right}\u27e8\bm{L}\left(\bm{x}\left(\bm{\theta}\right)\right),\bm{y}\u27e9$$  (6) 
where $\bm{\theta}$ denotes the weights of detector model. As the activation function $\bm{L}(\cdot )$ is nondifferentiable, a novel optimization/learning scheme is required instead of the standard gradient descent method.
Besides the AP metric, other ranking based metric can also be used to design the ranking loss for our framework. One example is the AUCloss [12] which measures the area under ROC curve for ranking purpose, and has a slightly different “activation function” as
$${L}_{ij}^{\prime}\left(\bm{x}\right)=\frac{H\left({x}_{ij}\right)}{\left\mathcal{N}\right}$$  (7) 
As AP is consistent with the evaluation metric of the object detection task, we argue that APloss is intuitively more suitable than AUCloss for this task, and will provide empirical study in our experiments.
3.2 Optimization Algorithm
3.2.1 ErrorDriven Update
Recalling the perceptron learning algorithm, the update for input variable is “errordriven”, which means the update is directly derived from the difference between desired output and current output. We adopt this idea and further generalize it to accommodate the case of activation function with vectorvalued input and output. Suppose ${x}_{ij}$ is the input and ${L}_{ij}$ is the current output, the update for ${x}_{ij}$ is thus
$$\mathrm{\Delta}{x}_{ij}={L}_{ij}^{*}{L}_{ij}$$  (8) 
where ${L}_{ij}^{*}$ is the desired output. Note that the APloss achieves its minimum possible value $0$ when each term ${L}_{ij}\cdot {y}_{ij}=0$. There are two cases. If ${y}_{ij}=1$, we should set the desired output ${L}_{ij}^{*}=0$. If ${y}_{ij}=0$, we do not care the update and set it to $0$, since it does not contribute to the APloss. Consequently, the update can be simplified as
$$\mathrm{\Delta}{x}_{ij}={L}_{ij}\cdot {y}_{ij}$$  (9) 
3.2.2 Backpropagation
We now have the desired vectorform update $\mathrm{\Delta}\bm{x}$, and then will find an update for model weights $\mathrm{\Delta}\bm{\theta}$ which will produce most appropriate movement for $\bm{x}$. We use dotproduct to measure the similarity of successive movements, and regularize the change of weights (i.e. $\mathrm{\Delta}\bm{\theta}$) with ${L}_{2}$norm based penalty term. The optimization problem can be written as:
$$\mathrm{arg}\underset{\mathrm{\Delta}\bm{\theta}}{\mathrm{min}}\left\{\u27e8\mathrm{\Delta}\bm{x},\bm{x}\left({\bm{\theta}}^{\left(n\right)}+\mathrm{\Delta}\bm{\theta}\right)\bm{x}\left({\bm{\theta}}^{\left(n\right)}\right)\u27e9+\lambda {\parallel \mathrm{\Delta}\bm{\theta}\parallel}_{2}^{2}\right\}$$  (10) 
where ${\bm{\theta}}^{(n)}$ denotes the model weights at the $n$th step. With that, the firstorder expansion of $\bm{x}(\bm{\theta})$ is given by:
$$\bm{x}\left(\bm{\theta}\right)=\bm{x}\left({\bm{\theta}}^{\left(n\right)}\right)+\frac{\partial \bm{x}\left({\bm{\theta}}^{\left(n\right)}\right)}{\partial \bm{\theta}}\cdot \left(\bm{\theta}{\bm{\theta}}^{\left(n\right)}\right)+o\left(\parallel \bm{\theta}{\bm{\theta}}^{\left(n\right)}\parallel \right)$$  (11) 
where $\partial \bm{x}({\bm{\theta}}^{(n)})/\partial \bm{\theta}$ is the Jacobian matrix of vectorvalued function $\bm{x}(\bm{\theta})$ at ${\bm{\theta}}^{(n)}$. Ignoring the highorder infinitesimal, we obtain the stepwise minimization process:
$${\bm{\theta}}^{\left(n+1\right)}{\bm{\theta}}^{\left(n\right)}=\mathrm{arg}\underset{\mathrm{\Delta}\bm{\theta}}{\mathrm{min}}\left\{\u27e8\mathrm{\Delta}\bm{x},\frac{\partial \bm{x}\left({\bm{\theta}}^{\left(n\right)}\right)}{\partial \bm{\theta}}\mathrm{\Delta}\bm{\theta}\u27e9+\lambda {\parallel \mathrm{\Delta}\bm{\theta}\parallel}_{2}^{2}\right\}$$  (12) 
The optimal solution can be obtained by finding the stationary point. Then, the form of optimal $\mathrm{\Delta}\bm{\theta}$ is consistent with the chain rule of derivative, which means, it can be directly implemented by setting the gradient of ${x}_{ij}$ to $\mathrm{\Delta}{x}_{ij}$ (c.f. Equation 9) and proceeding with backpropagation. Hence the gradient for score ${s}_{i}$ can be obtained by backward propagating the gradient through the difference transformation:
$$\begin{array}{cc}\hfill {g}_{i}=\sum _{j,k}& \mathrm{\Delta}{x}_{jk}\cdot \frac{\partial {x}_{jk}}{\partial {s}_{i}}=\sum _{j}\mathrm{\Delta}{x}_{ij}\sum _{j}\mathrm{\Delta}{x}_{ji}\hfill \\ & =\sum _{j}{L}_{ji}\cdot {y}_{ji}\sum _{j}{L}_{ij}\cdot {y}_{ij}\hfill \end{array}$$  (13) 
3.3 Analyses
Convergence: To better understand the characteristics of the APloss, we first provide a theoretical analysis on the convergence of the optimization algorithm, which is generalized from the convergence property of the original perceptron learning algorithm.
Proposition 1
The APloss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;
(2) the training data is linearly separable.
The proof of this proposition is provided in Appendix1 of supplementary. Although convergence is somewhat weak due to the need of strong conditions, it is nontrivial since the APloss function is not convex or quasiconvex even for the case of linear model and linearly separable data, so that gradient descent based algorithm may still fail to converge on a smoothed APloss function even under such strong conditions. One such example is presented in Appendix2 of supplementary. It means that, under such conditions, our algorithm still optimizes better than the approximate gradient descent algorithm for APloss. Furthermore, with some mild modifications, even though the training data is not separable, the accumulated APloss can also be bounded proportionally by the best performance of the learning model. More details are presented in Appendix3 of supplementary.
Consistency: Besides convergence, We observed that the proposed optimization algorithm is inherently consistent with widely used classificationloss functions.
Observation 1
When the activation function $L\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ takes the form of softmax function and lossaugmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on crossentropy loss and hinge loss respectively.
The detailed analysis of this observation is presented in Appendix4 of supplementary. We argue that the observed consistency is on the basis of the “errordriven” property. As is known, the gradients of those widely used loss functions are proportional to their prediction errors, where the prediction here refers to the output of activation function. In other words, their activation functions have a nice property: the vector field of prediction errors is conservative, allowing it being the gradient of some surrogate loss function. However, our activation function does not have this property, which makes our optimization not able to express as gradient descent with any surrogate loss function.
3.4 Details of Training Algorithm
{algorithm}[t] \[email protected]@algorithmic[1] \RequireAll scores $\{{s}_{i}\}$ and corresponding labels $\{{t}_{i}\}$ in a minibatch \EnsureGradient of input $\{{g}_{i}\}$ \State$\forall i,{g}_{i}\leftarrow 0$ \State$\text{MaxPrec}\leftarrow 0$ \State$\mathcal{P}\leftarrow \{i\mid {t}_{i}=1\},\mathcal{N}\leftarrow \{i\mid {t}_{i}=0\}$ \State$O\leftarrow argsort(\{{s}_{i}\mid i\in \mathcal{P}\})$ \CommentIndexes of scores sorted in ascending order \For$i\in O$ \StateCompute ${x}_{ij}={s}_{j}{s}_{i}$ for all $j\in \mathcal{P}\cup \mathcal{N}$ and ${L}_{ij}$ for all $j\in \mathcal{N}$ \CommentAccording to Equation 3 and Equation 14 \State$\text{Prec}\leftarrow 1{\sum}_{j\in \mathcal{N}}{L}_{ij}$ \If$\text{Prec}\ge \text{MaxPrec}$ \State$\text{MaxPrec}\leftarrow \text{Prec}$ \Else\CommentInterpolation \State$\forall j\in \mathcal{N},{L}_{ij}\leftarrow {L}_{ij}\cdot (1\text{MaxPrec})/(1\text{Prec})$ \EndIf\State${g}_{i}\leftarrow {\sum}_{j\in \mathcal{N}}{L}_{ij}$ \CommentAccording to Equation 13 \State$\forall j\in \mathcal{N},{g}_{j}\leftarrow {g}_{j}+{L}_{ij}$ \CommentAccording to Equation 13 \EndFor\State$\forall i,{g}_{i}\leftarrow {g}_{i}/\mathcal{P}$ \CommentNormalization



Minibatch Training The minibatch training strategy is widely used in deep learning frameworks [8, 18, 15] as it accounts for more stability than the case with batch size equal to 1. The minibatch training helps our optimization algorithm quite a lot for escaping the socalled “scoreshift” scenario. The APloss can be computed both from a batch of images and from a single image with multiple anchor boxes. Consider an extreme case: our detector can predict perfect ranking in both image ${I}_{1}$ and image ${I}_{2}$, but the lowest score in image ${I}_{1}$ is even greater than the highest score in image ${I}_{2}$. There are “scoreshift” between two images so that the detection performance is poor when computing APloss perimage. Aggregating scores over images in a minibatch can avoid such problem, so that the minibatch training is crucial for good convergence and good performance.
Piecewise Step function During early stage of training, scores ${s}_{i}$ are very close to each other (i.e. almost all inputs to Heaviside step function $H(x)$ are near zero), so that a small change of input will cause a big output difference, which destabilizes the updating process. To tackle this issue, we replace $H(x)$ with a piecewise step function:
$$  (14) 
The piecewise step functions with different $\delta $ are shown in Figure 4. When $\delta $ approaches $+0$, the piecewise step function approaches the original step function. Note that $f(\cdot )$ is only different from $H(\cdot )$ near zero point. We argue that the precise form of the piecewise step function is not crucial. Other monotonic and symmetric smooth functions that only differs from $H(\cdot )$ near zero point could be equally effective. The choice of $\delta $ relates closely to the weight decay hyperparameter in CNN optimization. Intuitively, parameter $\delta $ controls the width of decision boundary between positive and negative samples. Smaller $\delta $ enforces a narrower decision boundary, which causes the weights to shrink correspondingly (similar effect to that caused by the weight decay). Further details are presented in the experiments.
Interpolated AP The interpolated AP [27] is widely adopted by many object detection benchmarks like PASCAL VOC [5] and MS COCO [16]. The common justification for interpolating the precisionrecall curve [5] is “to reduce the impact of ’wiggles’ in the precisionrecall curve, caused by small variations in the ranking of examples”. Under the same consideration, we adopt the interpolated AP instead of the original version. Specifically, the interpolation is applied on ${L}_{ij}$ to make the precision at the $k$th smallest positive sample monotonically increasing with $k$ where the precision is $(1{\sum}_{j\in \mathcal{N}}{L}_{ij})$ in which $i$ is the index of the $k$th smallest positive sample. It is worth noting that the interpolated AP is a smooth approximation of the actual AP so that it is a practical choice to help to stabilize the gradient and to reduce the impact of ’wiggles’ in the update signals. The details of the interpolated AP based algorithm is summarized in Algorithm 3.4.
4 Experiments
4.1 Experimental Settings
We evaluate the proposed method on the stateoftheart onestage detector RetinaNet [15]. The implementation details are the same as in [15] unless explicitly stated. Our experiments are performed on two benchmark datasets: PASCAL VOC [5] and MS COCO [16]. The PASCAL VOC dataset has 20 classes, with VOC2007 containing 9,963 images for train/val/test and VOC2012 containing 11,530 for train/val. The MS COCO dataset has 80 classes, containing 123,287 images for train/val. We implement our codes with the MXNET framework, and conduct experiments on a workstation with two NVidia TitanX GPUs.
PASCAL VOC: When evaluated on the VOC2007 test set, models are trained on the VOC2007 and VOC2012 trainval sets. When evaluated on the VOC2012 test set, models are trained on the VOC2007 and VOC2012 trainval sets plus the VOC2007 test set. Similar to the evaluation metrics used in the MS COCO benchmark, we also report the AP averaged over multiple IoU thresholds of $0.50:0.05:0.95$. We set $\delta =1$ in Equation 14. We use ResNet [8] as the backbone model which is pretrained on the ImageNet1k classification dataset [4]. At each level of FPN [14], the anchors have 2 suboctave scales (${2}^{k/2}$, for $k\le 1$) and 3 aspect ratios [0.5, 1, 2]. We fix the batch normalization layers to be frozen in training phase. We adopt the minibatch training on 2 GPUs with 8 images per GPU. All evaluated models are trained for 160 epochs with an initial learning rate of 0.001 which is then divided by 10 at 110 epochs and again at 140 epochs. Weight decay of 0.0001 and momentum of 0.9 are used. We adopt the same data augmentation strategies as [18], while do not use any data augmentation during testing phase. In training phase, the input image is fixed to 512$\times $512, while in testing phase, we maintain the original aspect ratio and resize the image to ensure the shorter side with 600 pixels. We apply the nonmaximum suppression with IoU of 0.5 for each class.
MS COCO: All models are trained on the widely used trainval35k set (80k train images and 35k subset of val images), and tested on minival set (5k subset of val images) or testdev set. We train the networks for 100 epochs with an initial learning rate of 0.001 which is then divided by 10 at 60 epochs and again at 80 epochs. Other details are similar to that for PASCAL VOC.
4.2 Ablation Study
We first investigate the impact of our design settings of the proposed framework. We fix the ResNet50 as backbone and conduct several controlled experiments on PASCAL VOC2007 test set (and COCO minival if stated) for this ablation study.
4.2.1 Comparison on Different Parameter Settings
Training Loss  PASCAL VOC  COCO  

AP  AP${}_{50}$  AP${}_{75}$  AP  AP${}_{50}$  AP${}_{75}$  
CELoss + OHEM  49.1  81.5  51.5  30.8  50.9  32.6 
Focal Loss  51.3  80.9  55.3  33.9  55.0  35.7 
AUCLoss  49.3  79.7  51.8  25.5  44.9  26.0 
APLoss  53.1  82.3  58.1  35.0  57.2  36.6 
Here we study the impact of the practical modifications introduced in Section 3.4. All results are shown in Table 1.
Minibatch Training: First, we study the minibatch training, and report detector results at different batchsize in (a). It shows that larger batchsize (i.e. 8) outperforms all the other smaller batchsize. This verifies our previous hypothesis that large minibatch training helps to eliminate the “scoreshift” from different images, and thus stabilizes the APloss through robust gradient calculation. Hence, $\text{batchsize}=8$ is used in our further studies.
Piecewise Step Function: Second, we study the piecewise step function, and report detector performance on the piecewise step function with different $\delta $ in (b). As mentioned before, we argue that the choice of $\delta $ is trivial and is dependent on other network hyperparameters such as weight decay. Smaller $\delta $ makes the function sharper, which yields unstable training at initial phase. Larger $\delta $ makes the function deviate from the properties of the original APloss, which also worsens the performance. $\delta =1$ is a good choice we used in our further studies.
Interpolated AP: Third, we study the impact of interpolated AP in our optimization algorithm, and list the results in (c). Marginal benefits are observed for interpolated AP over standard AP, so we use interpolated AP in all the following studies.
4.2.2 Comparison on Different Losses
We evaluate with different losses on RetinaNet [15]. Results are shown in Table 2. We compare traditional classification based losses like focal loss [15] and cross entropy loss (CEloss) with OHEM [18] to the ranking based losses like AUCloss and APloss. Although focal loss is significantly better than CEloss with OHEM on COCO dataset, it is interesting that focalloss does not perform better than CEloss at AP${}_{50}$ on PASCAL VOC. This is likely because the hyperparameters of focal loss are designed to suit the imbalance condition on COCO dataset which is not suitable for PASCAL VOC, so that focal loss cannot generalize well to PASCAL VOC without tuning its hyperparameters. The proposed APloss performs much better than all the other losses on both two datasets, which demonstrates its effectiveness and stronger generalization ability on handling the imbalance issue. It is worth noting that AUCloss performs much worse than APloss, which may be due to the fact that AUC has equal penalty for each misordered pair while AP imposes greater penalty for the misordering at higher positions in the predicted ranking. It is obvious that object detection evaluation concerns more on objects with higher confidence, which is why AP provides a better loss measure. Furthermore, an assessment of the detection performance at different training iterations, as shown in (a), outlines the superiority of the APloss for snapshot time points.
Method  Backbone  MultiScale  VOC07  VOC12  COCO  
AP${}_{50}$  AP${}_{50}$  AP  AP${}_{50}$  AP${}_{75}$  AP${}_{S}$  AP${}_{M}$  AP${}_{L}$  
YOLOv2 [24]  DarkNet19  ✗  78.6  73.4  21.6  44.0  19.2  5.0  22.4  35.5 
DSOD300 [30]  DS/64192481  ✗  77.7  76.3  29.3  47.3  30.6  9.4  31.5  47.0 
SSD512 [18]  VGG16  ✗  79.8  78.5  28.8  48.5  30.3       
SSD513 [6]  ResNet101  ✗  80.6  79.4  31.2  50.4  33.3  10.2  34.5  49.8 
DSSD513 [6]  ResNet101  ✗  81.5  80.0  33.2  53.3  35.2  13.0  35.4  51.1 
DES512 [39]  VGG16  ✗  81.7  80.3  32.8  53.2  34.6  13.9  36.0  47.6 
RFBNet512 [17]  VGG16  ✗  82.2    33.8  54.2  35.9  16.2  37.1  47.4 
PFPNetR512 [10]  VGG16  ✗  82.3  80.3  35.2  57.6  37.9  18.7  38.6  45.9 
RefineDet512 [38]  VGG16  ✗  81.8  80.1  33.0  54.5  35.5  16.3  36.3  44.3 
RefineDet512 [38]  ResNet101  ✗      36.4  57.5  39.5  16.6  39.9  51.4 
RetinaNet500 [15]  ResNet101  ✗      34.4  53.1  36.8  14.7  38.5  49.1 
RetinaNet500+APLoss (ours)  ResNet101  ✗  83.9  83.1  37.4  58.6  40.5  17.3  40.8  51.9 
PFPNetR512 [10]  VGG16  ✓  84.1  83.7  39.4  61.5  42.6  25.3  42.3  48.8 
RefineDet512 [38]  VGG16  ✓  83.8  83.5  37.6  58.7  40.8  22.7  40.3  48.3 
RefineDet512 [38]  ResNet101  ✓      41.8  62.9  45.7  25.6  45.1  54.1 
RetinaNet500+APLoss (ours)  ResNet101  ✓  84.9  84.5  42.1  63.5  46.4  25.6  45.0  53.9 
4.2.3 Comparison on Different Optimization Methods
We also compare our optimization method with the approximate gradient method [34, 9] and structured hinge loss method [20]. Both [34, 9] approximate the APloss with a smooth expectation and envelope function, respectively. Following their guidance, we replace the step function in APloss with a sigmoid function to constrain the gradient to neither zero nor undefined, while still keep the shape similar to the original function. Same as [9], we adopt the log space objective function, i.e. $log(\text{AP}+\u03f5)$, to allow the model to quickly escape from the initial state. We train the detector on VOC2007 trainval set and turn off the bounding box regression task. The convergence curves shown in (b) reveal some essential observations. It can be seen that APloss optimized by approximate gradient method does not even converge, likely because its nonconvexity and nonquasiconvexity fail on a direct gradient descent method. Meanwhile, APloss optimized by the structured hinge loss method [20] converges slowly and stabilizes near 0.8, which is significantly worse than the asymptotic limit of APloss optimized by our errordriven update scheme. We believe that this method does not optimize the APloss directly but rather an upper bound of it, which is controlled by a discriminant function [20]. In ranking task, this discriminant function is handpicked and has an AUClike form, which may cause variability in optimization.
4.3 Benchmark Results
With the settings selected in ablation study, we conduct experiments to compare the proposed detector to stateoftheart onestage detectors on three widely used benchmark, i.e. VOC2007 test, VOC2012 test and COCO testdev sets. We use ResNet101 as backbone networks instead of ResNet50 in ablation study. We use an image scale of 500 pixels for testing. Table 3 lists the benchmark results comparing to recent stateoftheart onestage detectors such as SSD [18], YOLOv2 [24], DSSD [6], DSOD [30], DES [39], RetinaNet [15], RefineDet [38], PFPNet [10], RFBNet [17]. Compared to the baseline model RetinaNet500 [15], our detector achieves a 3.0% improvement (37.4% vs. 34.4%) on COCO dataset. Figure 6 illustrates some detection results by the RetinaNet with focal loss and our APloss. Besides, our detector outperforms all the other methods for both singlescale and multiscale tests in all the three benchmarks. We should emphasize that this verifies the great effectiveness of our APloss since our detector achieves such a great performance gain just by replacing the focalloss with our APloss in RetinaNet without whistle and bells, without using advanced techniques like deformable convolution [3], SNIP [33], group normalization [36], etc. The performance could be further improved with these kinds of techniques and other possible tricks. Our detector has the same detection speed (i.e., $\sim $$11fps$ on one NVidia TitanX GPU) as RetinaNet500 [15] since it does not change the network architecture for inference.
5 Conclusion
In this paper, we address the class imbalance issue in onestage object detectors by replacing the classification subtask with a ranking subtask, and proposing to solve the ranking task with APLoss. Due to nondifferentiability and nonconvexity of the APloss, we propose a novel algorithm to optimize it based on errordriven update scheme from perceptron learning. We provide a grounded theoretical analysis of the proposed optimization algorithm. Experimental results show that our approach can significantly improve the stateoftheart onestage detectors.
Acknowledgements. This paper is supported in part by: National Natural Science Foundation of China (61471235), Shanghai ’The Belt and Road’ Young Scholar Exchange Grant (17510740100), CREST Malaysia (No. T03C117), and the PKUNTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation. We gratefully acknowledge the support from Tencent YouTu Lab.
References
 [1] JK Anlauf and M Biehl. The adatron: an adaptive perceptron algorithm. EPL (Europhysics Letters), 10(7):687, 1989.
 [2] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. Rfcn: Object detection via regionbased fully convolutional networks. In NIPS, pages 379–387, 2016.
 [3] Jifeng Dai, Haozhi Qi, Yuwen Xiong, et al. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
 [4] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255, 2009.
 [5] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
 [6] ChengYang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017.
 [7] Ross Girshick. Fast rcnn. In ICCV, pages 1440–1448, 2015.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [9] Paul Henderson and Vittorio Ferrari. Endtoend training of object class detectors for mean average precision. In ACCV, pages 198–213, 2016.
 [10] SeungWook Kim, HyongKeun Kook, JeeYoung Sun, MunCheon Kang, and SungJea Ko. Parallel feature pyramid network for object detection. In ECCV, pages 234–250, 2018.
 [11] Werner Krauth and Marc Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A: Mathematical and General, 20(11):L745, 1987.
 [12] Jianguo Li and Y Zhang. Learning surf cascade for fast and accurate object detection. In CVPR, 2013.
 [13] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. TinyDSOD: Lightweight object detection for resourcerestricted usages. In BMVC, 2018.
 [14] TsungYi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 3, 2017.
 [15] TsungYi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Trans on PAMI, 2018.
 [16] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
 [17] Songtao Liu, Di Huang, and andYunhong Wang. Receptive field block net for accurate and fast object detection. In The European Conference on Computer Vision (ECCV), September 2018.
 [18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
 [19] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Efficient optimization for average precision svm. In NIPS, pages 2312–2320, 2014.
 [20] Pritish Mohapatra, Michal Rolinek, C.V. Jawahar, Vladimir Kolmogorov, and M. Pawan Kumar. Efficient optimization for rankbased loss functions. In CVPR, 2018.
 [21] A Novikoff. On convergence proofs for perceptrons. Proc.sympos.math.theory of Automata, pages 615–622, 1963.
 [22] Yongming Rao, Dahua Lin, Jiwen Lu, and Jie Zhou. Learning globally optimized object detector via policy gradient. In CVPR, pages 6190–6198, 2018.
 [23] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In CVPR, pages 779–788, 2016.
 [24] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
 [25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, pages 91–99, 2015.
 [26] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
 [27] Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986.
 [28] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
 [29] Shai ShalevShwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
 [30] Zhiqiang Shen, Zhuang Liu, Jianguo Li, YuGang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In ICCV, 2017.
 [31] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
 [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [33] Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection–snip. In CVPR, 2018.
 [34] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct loss minimization. In ICML, pages 2169–2177, 2016.
 [35] A Wendemuth. Learning the unlearnable. Journal of Physics A: Mathematical and General, 28(18):5423, 1995.
 [36] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
 [37] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In SIGIR, pages 271–278. ACM, 2007.
 [38] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Singleshot refinement neural network for object detection. In CVPR, 2018.
 [39] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L. Yuille. Singleshot object detection with enriched semantics. In CVPR, 2018.
A1. Convergence
We provide proof for the proposition mentioned in Section 3.3.1 of the paper. The proof is generalized from the original convergence proof [21] for perceptron learning algorithm.
Proposition 2
The APloss optimizing algorithm is guaranteed to converge in finite steps if below conditions hold: (1) the learning model is linear;
(2) the training data is linearly separable.
Proof. Let $\bm{\theta}$ denote the weights of the linear model. Let ${\bm{f}}_{k}^{(n)}$ denote the feature vector of $k$th box in $n$th training sample. Assume the number of training samples is finite and each training sample contains at most $M$ boxes. Hence the score of $k$th box is ${s}_{k}^{(n)}=\u27e8{\bm{f}}_{k}^{(n)},\bm{\theta}\u27e9$. Define ${x}_{ij}^{(n)}=({s}_{i}^{(n)}{s}_{j}^{(n)})$. Note that the training data is separable, which means there are $\u03f5>0$ and ${\bm{\theta}}^{*}$ that satisfy:
$$\forall n,\forall i\in {\mathcal{P}}^{\left(n\right)},\forall j\in {\mathcal{N}}^{\left(n\right)},\u27e8{\bm{f}}_{i}^{\left(n\right)},{\bm{\theta}}^{*}\u27e9\ge \u27e8{\bm{f}}_{j}^{\left(n\right)},{\bm{\theta}}^{*}\u27e9+\u03f5$$  (15) 
In the $t$th step, a training sample which makes an error (if there is no such training sample, the model is already optimal and algorithm will stop) is randomly chosen. Then the update of $\bm{\theta}$ is:
$${\bm{\theta}}^{\left(t+1\right)}={\bm{\theta}}^{\left(t\right)}+\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\left(\bm{x}\right)\cdot \left({\bm{f}}_{i}{\bm{f}}_{j}\right)$$  (16) 
where
$${L}_{ij}\left(\bm{x}\right)=\frac{H\left({x}_{ij}\right)}{1+{\sum}_{k\ne i}H\left({x}_{ik}\right)}$$  (17) 
Here, since the discussion centers on the current training sample, we omit the superscript hereon.
From (16), we have
$$\begin{array}{cc}\hfill \u27e8{\bm{\theta}}^{\left(t+1\right)},{\bm{\theta}}^{*}\u27e9& =\u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9+\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\u27e8\left({\bm{f}}_{i}{\bm{f}}_{j}\right),{\bm{\theta}}^{*}\u27e9\hfill \\ & \ge \u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9+\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\u03f5\hfill \\ & \ge \u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9+\underset{i\in \mathcal{P},j\in \mathcal{N}}{\mathrm{max}}\left\{{L}_{ij}\right\}\u03f5\hfill \\ & \ge \u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9+\frac{1}{\left\mathcal{P}\right+\left\mathcal{N}\right}\u03f5\hfill \\ & \ge \u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9+\frac{1}{M}\u03f5\hfill \end{array}$$  (18) 
For convenience, let ${\bm{\theta}}^{(0)}=0$ (if ${\bm{\theta}}^{(0)}\ne 0$, we can still find a $c>0$ that satisfies (20) for sufficiently large $t$), we have
$$\u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9\ge \frac{1}{M}\u03f5\cdot t$$  (19) 
Then
$$\parallel {\bm{\theta}}^{\left(t\right)}\parallel \ge \frac{\u27e8{\bm{\theta}}^{\left(t\right)},{\bm{\theta}}^{*}\u27e9}{\parallel {\bm{\theta}}^{*}\parallel}\ge \frac{1}{M\cdot \parallel {\bm{\theta}}^{*}\parallel}\u03f5\cdot t\ge c\cdot t$$  (20) 
Here, $c$ is a positive constant.
From (16), we also have
$$\begin{array}{cc}& {\parallel {\bm{\theta}}^{\left(t+1\right)}\parallel}^{2}\hfill \\ \hfill =& {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+{\parallel \sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\left({\bm{f}}_{i}{\bm{f}}_{j}\right)\parallel}^{2}\hfill \\ & +2\u27e8\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\left({\bm{f}}_{i}{\bm{f}}_{j}\right),{\bm{\theta}}^{\left(t\right)}\u27e9\hfill \\ \hfill =& {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+{\parallel \sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\left({\bm{f}}_{i}{\bm{f}}_{j}\right)\parallel}^{2}+2\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}{x}_{ji}\hfill \\ \hfill \le & {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+{\parallel \sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}{L}_{ij}\left({\bm{f}}_{i}{\bm{f}}_{j}\right)\parallel}^{2}\hfill \\ \hfill \le & {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+\left\mathcal{P}\right\cdot \left\mathcal{N}\right\cdot \underset{i\in \mathcal{P},j\in \mathcal{N}}{\mathrm{max}}\left\{{\parallel {\bm{f}}_{i}{\bm{f}}_{j}\parallel}^{2}\right\}\hfill \\ \hfill \le & {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+{M}^{2}\cdot \underset{n,i,j}{\mathrm{max}}\left\{{\parallel {\bm{f}}_{i}^{\left(n\right)}{\bm{f}}_{j}^{\left(n\right)}\parallel}^{2}\right\}\hfill \\ \hfill \le & {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}+C\hfill \end{array}$$  (21) 
Here, $C$ is a positive constant. Let ${\bm{\theta}}^{(0)}=0$ (again, if ${\bm{\theta}}^{(0)}\ne 0$, we can still find a $C>0$ that satisfies (22) for sufficiently large $t$), we arrive at:
$${\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}\le C\cdot t$$  (22) 
Then, combining (20) and (22), we have
$${c}^{2}\cdot {t}^{2}\le {\parallel {\bm{\theta}}^{\left(t\right)}\parallel}^{2}\le C\cdot t$$  (23) 
which means
$$t\le \frac{C}{{c}^{2}}$$  (24) 
It shows that the algorithm will stop at most after $C/{c}^{2}$ steps, which means that the training model will achieve the optimal solution in finite steps.
A2. An Example of Gradient Descent Failing on Smoothed APloss
We approximate the step function in APloss by sigmoid function to make it amenable to gradient descent. Specifically, the smoothed APloss function is given by:
$$F=\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\sum _{j\in \mathcal{N}}\frac{S\left({x}_{ij}\right)}{1+{\sum}_{k\ne i}S\left({x}_{ik}\right)}$$  (25) 
where
$$S\left(x\right)=\frac{{e}^{x}}{1+{e}^{x}}$$  (26) 
Consider a linear model $s={f}_{1}{\theta}_{1}+{f}_{2}{\theta}_{2}$ and three training samples $(0,0),(1,0),(3,1)$ (the first one is negative sample, others are positive samples). Then we have
$$\begin{array}{cc}& {s}^{\left(1\right)}=0\cdot {\theta}_{1}+0\cdot {\theta}_{2}\hfill \\ & {s}^{\left(2\right)}=1\cdot {\theta}_{1}+0\cdot {\theta}_{2}\hfill \\ & {s}^{\left(3\right)}=3\cdot {\theta}_{1}+1\cdot {\theta}_{2}\hfill \end{array}$$  (27) 
Note that the training data is separable since we have ${s}^{(2)}>{s}^{(1)}$ and ${s}^{(3)}>{s}^{(1)}$ when $$.
Under this setting, the smoothed APloss become
$$\begin{array}{cc}& F({\theta}_{1},{\theta}_{2})=\frac{1}{2}(\frac{S\left({\theta}_{1}\right)}{1+S\left({\theta}_{1}\right)+S\left({\theta}_{2}4{\theta}_{1}\right)}\hfill \\ & +\frac{S\left(3{\theta}_{1}{\theta}_{2}\right)}{1+S\left(4{\theta}_{1}{\theta}_{2}\right)+S\left(3{\theta}_{1}{\theta}_{2}\right)})\hfill \end{array}$$  (28) 
If ${\theta}_{1}$ is sufficiently large and ${\theta}_{1}>{\theta}_{2}>0$, then the partial derivatives satisfy the following condition:
$$  (29) 
which means ${\theta}_{1}$ and ${\theta}_{2}$ will keep increasing with the inequality ${\theta}_{1}>{\theta}_{2}$ according to the gradient descent algorithm. Hence the objective function $F$ will approach $1/6$ here. However, the objective function $F$ approaches the global minimum value $0$ if and only if ${\theta}_{1}\to +\mathrm{\infty}$ and ${\theta}_{2}3{\theta}_{1}\to +\mathrm{\infty}$. This shows that the gradient descent fails to converge to global minimum in this case.
A3. Inseparable Case
In this section, we will provide analysis for our algorithm with inseparable training data. We demonstrate that the bound of accumulated APloss depends on the best performance of learning model. The analysis is based on online learning bounds [29].
A3.1. Preliminary
To handle the inseparable case, a mild modification on the proposed algorithm is needed, i.e. in the errordriven update scheme, ${L}_{ij}$ is modified to
$${\stackrel{~}{L}}_{ij}=\frac{\stackrel{~}{H}\left({x}_{ij}\right)}{1+{\sum}_{k\in \mathcal{P}\cup \mathcal{N},k\ne i}H\left({x}_{ik}\right)}$$  (30) 
where $\stackrel{~}{H}(\cdot )$ is defined in Section 3.4.2 (Piecewise Step Function) of the paper. The purpose is to introduce a nonzero decision margin for the pairwise score ${x}_{ij}$ which makes the algorithm more robust in the inseparable case. In contrast to the case in Section 3.4.2, here we only change $H(\cdot )$ to $\stackrel{~}{H}(\cdot )$ in the numerator for the convenience of theoretical anaysis. However, such algorithm still suffers from the discontinuity of $H(\cdot )$ in the denominator. Hence the strategy in Section 3.4.2 is also practical consideration, necessary for good performance. Then, consider the APloss:
$${\mathcal{L}}_{AP}(\bm{x};\mathcal{P},\mathcal{N})=\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{{\sum}_{j\in \mathcal{N}}H\left({x}_{ij}\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({x}_{ij}\right)}$$  (31) 
and define a surrogate loss function:
$$l(\bm{x},\widehat{\bm{x}};\mathcal{P},\mathcal{N})=\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{{\sum}_{j\in \mathcal{N}}Q\left({x}_{ij}\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({\widehat{x}}_{ij}\right)}$$  (32) 
where $Q(x)={\int}_{\mathrm{\infty}}^{x}\stackrel{~}{H}(\upsilon )\mathit{d}\upsilon $. Note that the APloss is upper bounded by the surrogate loss:
$$l(\bm{x},\bm{x};\mathcal{P},\mathcal{N})\ge \frac{\delta}{4}{\mathcal{L}}_{AP}(\bm{x};\mathcal{P},\mathcal{N})$$  (33) 
The learning model can be written as $\bm{x}={\bm{X}}_{d}(\bm{\theta})$, where $d\in \mathcal{D}$ denotes the training data for one iteration and $D$ is the whole training set. Then, the modified errordriven algorithm is equivalent to gradient descent on surrogate loss $l({\bm{X}}_{{d}^{(t)}}(\bm{\theta}),{\bm{X}}_{{d}^{(t)}}({\bm{\theta}}^{(t)});{\mathcal{P}}_{{d}^{(t)}},{\mathcal{N}}_{{d}^{(t)}})$ at each step $t$. We further suppose below conditions are satisfied:
(1) For all $\widehat{\bm{\theta}}$ and $d\in \mathcal{D}$, $l({\bm{X}}_{d}(\bm{\theta}),{\bm{X}}_{d}(\widehat{\bm{\theta}});{\mathcal{P}}_{d},{\mathcal{N}}_{d})$ is convex w.r.t $\bm{\theta}$.
(2) For all $d\in \mathcal{D}$, $\parallel \partial {\bm{X}}_{d}(\bm{\theta})/\partial \bm{\theta}\parallel $ is upper bounded by a constant $R$. Here $\parallel \cdot \parallel $ is the matrix norm induced by the 2norm for vectors.
Remark 1. Note that these two conditions are satisfied if the learning model is linear.
A3.2. Bound of Accumulated Loss
By the convexity, we have:
$${l}^{\left(t\right)}\left(\bm{\theta}\right)\le {l}^{\left(t\right)}\left(\bm{u}\right)+\u27e8\bm{\theta}\bm{u},\frac{\partial {l}^{\left(t\right)}\left(\bm{\theta}\right)}{\partial \bm{\theta}}\u27e9$$  (34) 
where ${l}^{(t)}(\bm{\theta})$ denotes $l({\bm{X}}_{{d}^{(t)}}(\bm{\theta}),{\bm{X}}_{{d}^{(t)}}({\bm{\theta}}^{(t)});{\mathcal{P}}_{{d}^{(t)}},{\mathcal{N}}_{{d}^{(t)}})$ and $\bm{u}$ can be any vector of model weights. Then, let $\bm{\theta}={\bm{\theta}}^{(t)}$ and compute the sum over $t=1\sim T$, we have:
$$\begin{array}{cc}& \sum _{t=1}^{T}{l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)\sum _{t=1}^{T}{l}^{\left(t\right)}\left(\bm{u}\right)\le \sum _{t=1}^{T}\u27e8{\bm{\theta}}^{\left(t\right)}\bm{u},\frac{\partial {l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)}{\partial \bm{\theta}}\u27e9\hfill \\ & =\sum _{t=1}^{T}\u27e8{\bm{\theta}}^{\left(t\right)}\bm{u},\frac{1}{\eta}\left({\bm{\theta}}^{\left(t\right)}{\bm{\theta}}^{\left(t+1\right)}\right)\u27e9\hfill \\ & \le \frac{1}{2\eta}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}+\frac{1}{2\eta}\sum _{t=1}^{T}{\parallel {\bm{\theta}}^{\left(t\right)}{\bm{\theta}}^{\left(t+1\right)}\parallel}^{2}\hfill \\ & =\frac{1}{2\eta}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}+\frac{\eta}{2}\sum _{t=1}^{T}{\parallel \frac{\partial {l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)}{\partial \bm{\theta}}\parallel}^{2}\hfill \end{array}$$  (35) 
where $\eta $ is the step size of gradient descent. Note that
$$\frac{\partial {l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)}{\partial \bm{\theta}}={{\frac{\partial \bm{X}\left(\bm{\theta}\right)}{\partial \bm{\theta}}}_{\bm{\theta}={\bm{\theta}}^{\left(t\right)}}\cdot \frac{\partial l(\bm{x},{\bm{x}}^{\left(t\right)})}{\partial \bm{x}}}_{\bm{x}=\bm{X}\left({\bm{\theta}}^{\left(t\right)}\right)}$$  (36) 
and
$$\begin{array}{cc}& {\parallel \frac{\partial l(\bm{x},{\bm{x}}^{\left(t\right)})}{\partial \bm{x}}\parallel}^{2}=\frac{1}{{\left\mathcal{P}\right}^{2}}\sum _{i\in \mathcal{P}}\frac{{\sum}_{j\in \mathcal{N}}{\stackrel{~}{H}}^{2}\left({x}_{ij}\right)}{{\left(1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({x}_{ij}^{\left(t\right)}\right)\right)}^{2}}\hfill \\ & \le \frac{1}{{\left\mathcal{P}\right}^{2}}\sum _{i\in \mathcal{P}}\frac{\frac{1}{\delta}{\sum}_{j\in \mathcal{N}}Q\left({x}_{ij}\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({x}_{ij}^{\left(t\right)}\right)}\le \frac{1}{\delta}l(\bm{x},{\bm{x}}^{\left(t\right)})\hfill \end{array}$$  (37) 
Note that both ${\mathcal{P}}_{d}$ and ${\mathcal{N}}_{d}$ depend on $d$. However, we omit the subscript $d$ here since the discussion only centers on the current training sample ${d}^{(t)}$.
Hence we have:
$$\begin{array}{cc}& \sum _{t=1}^{T}{l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)\sum _{t=1}^{T}{l}^{\left(t\right)}\left(\bm{u}\right)\hfill \\ & \le \frac{1}{2\eta}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}+\frac{\eta {R}^{2}}{2\delta}\sum _{t=1}^{T}{l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right).\hfill \end{array}$$  (38) 
Let $\eta =\delta /{R}^{2}$, rearrange and get the expression:
$$\frac{1}{2}\sum _{t=1}^{T}{l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)\le \frac{{R}^{2}}{2\delta}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}+\sum _{t=1}^{T}{l}^{\left(t\right)}\left(\bm{u}\right)$$  (39) 
This entails the bound of surrogate loss $l$:
$$\begin{array}{c}\hfill \sum _{t=1}^{T}{l}^{\left(t\right)}\left({\bm{\theta}}^{\left(t\right)}\right)\le 2\sum _{t=1}^{T}{l}^{\left(t\right)}\left(\bm{u}\right)+\frac{{R}^{2}}{\delta}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}\end{array}$$  (40) 
which implies the bound of APloss ${\mathcal{L}}_{AP}$:
$$\sum _{t=1}^{T}{\mathcal{L}}_{AP}\left(\bm{X}\left({\bm{\theta}}^{\left(t\right)}\right)\right)\le \frac{8}{\delta}\sum _{t=1}^{T}{l}^{\left(t\right)}\left(\bm{u}\right)+\frac{4{R}^{2}}{{\delta}^{2}}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}$$  (41) 
As a special case, if there exists a $\bm{u}$ such that ${l}^{(t)}(\bm{u})=0$ for all $t$, then the accumulated APloss is bounded by a constant, which implies that convergence can be achieved with finite steps (similar to that of the separable case). Otherwise, with sufficiently large $T$, the average APloss mainly depends on $\frac{1}{T}\frac{8}{\delta}{\sum}_{t=1}^{T}{l}^{(t)}(\bm{u})$. This implies that the bound is meaningful if there still exists a sufficiently good solution $\bm{u}$ in such inseparable case.
A3.3. Offline Setting
With the offline setting (${d}^{(t)}=d$ for all $t$), a bound with simpler form can be revealed. For simplicity, we will omit the subscript $d$ of ${\bm{X}}_{d}(\bm{u}),{\mathcal{P}}_{d},{\mathcal{N}}_{d}$ and define ${A}_{i}(\bm{u})={\sum}_{j\in \mathcal{N}}Q({X}_{ij}(\bm{u}))$, $Z(\bm{u})={\mathrm{max}}_{i\in \mathcal{P}}\{{A}_{i}(\bm{u})\}$. Then,
$$\begin{array}{cc}& {l}^{\left(t\right)}\left(\bm{u}\right)=\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{{\sum}_{j\in \mathcal{N}}Q\left({X}_{ij}\left(\bm{u}\right)\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\hfill \\ & =\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{{A}_{i}\left(\bm{u}\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\hfill \\ & \le \frac{Z\left(\bm{u}\right)}{\left\mathcal{P}\right}\sum _{i=1}^{\left\mathcal{P}\right}\frac{1}{i}\le \frac{\mathrm{ln}\left\mathcal{P}\right+1}{\left\mathcal{P}\right}Z\left(\bm{u}\right)\hfill \end{array}$$  (42) 
The second last inequality is based on the fact that $(1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H({X}_{ij}({\bm{\theta}}^{(t)})))$ are picked from $1\sim (\mathcal{P}+\mathcal{N})$ without replacement (assume no ties; if ties exist, this inequality still holds). Combining the results from Equation 42 and Equation 41, we have:
$$\frac{1}{T}\sum _{t=1}^{T}{\mathcal{L}}_{AP}\left(\bm{X}\left({\bm{\theta}}^{\left(t\right)}\right)\right)\le \frac{\mathrm{ln}\left\mathcal{P}\right+1}{\left\mathcal{P}\right}\cdot \frac{8}{\delta}Z\left(\bm{u}\right)+\frac{1}{T}\frac{4{R}^{2}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}}{{\delta}^{2}}$$  (43) 
Next,
$$\begin{array}{cc}& {l}^{\left(t\right)}\left(\bm{u}\right)=\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{{A}_{i}\left(\bm{u}\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\hfill \\ & =\frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{1+{\sum}_{j\in \mathcal{P},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\hfill \\ & \cdot \frac{{A}_{i}\left(\bm{u}\right)}{1+{\sum}_{j\in \mathcal{P},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\hfill \\ & \le \frac{1}{\left\mathcal{P}\right}\sum _{i\in \mathcal{P}}\frac{1+{\sum}_{j\in \mathcal{P},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}{1+{\sum}_{j\in \mathcal{P}\cup \mathcal{N},j\ne i}H\left({X}_{ij}\left({\bm{\theta}}^{\left(t\right)}\right)\right)}\cdot Z\left(\bm{u}\right)\hfill \\ & =\left(1{\mathcal{L}}_{AP}\left(\bm{X}\left({\bm{\theta}}^{\left(t\right)}\right)\right)\right)\cdot Z\left(\bm{u}\right)\hfill \end{array}$$  (44) 
Combining the results from Equation 44 and Equation 41, we have:
$$\frac{1}{T}\sum _{t=1}^{T}{\mathcal{L}}_{AP}\left(\bm{X}\left({\bm{\theta}}^{\left(t\right)}\right)\right)\le \frac{\frac{8}{\delta}Z\left(\bm{u}\right)}{1+\frac{8}{\delta}Z\left(\bm{u}\right)}+\frac{1}{T}\frac{4{R}^{2}{\parallel \bm{u}{\bm{\theta}}^{\left(1\right)}\parallel}^{2}}{{\delta}^{2}}$$  (45) 
If $Z(\bm{u})$ is small, the bound in Equation 43 is active, otherwise the bound in Equation 45 is active. Consequently, we have:
$$\overline{{\mathcal{L}}_{AP}}\le \mathrm{min}\{\frac{\mathrm{ln}\left\mathcal{P}\right+1}{\left\mathcal{P}\right}\frac{8}{\delta}Z\left(\bm{u}\right),\frac{\frac{8}{\delta}Z\left(\bm{u}\right)}{1+\frac{8}{\delta}Z\left(\bm{u}\right)}\}+\u03f5$$  (46) 
where $\overline{{\mathcal{L}}_{AP}}$ denotes the average APloss, $\u03f5\to 0$ as $T$ increases.
A4. Consistency
Observation 2
When the activation function $L\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ takes the form of softmax function and lossaugmented step function, our optimization algorithm can be expressed as the gradient descent algorithm on crossentropy loss and hinge loss respectively.
Cross Entropy Loss: Consider the multiclass classification task. The outputs of neural network are $({x}_{1},\mathrm{\dots},{x}_{K})$ where $K$ is the number of classes, and the ground truth label is $y\in \{1,\mathrm{\dots},K\}$. Using softmax as the activation function, we have:
$$\begin{array}{cc}& ({L}_{1},\mathrm{\dots},{L}_{K})\hfill \\ & =softmax\left(\bm{x}\right)=(\frac{{e}^{{x}_{1}}}{{\sum}_{i}{e}^{{x}_{i}}},\mathrm{\dots},\frac{{e}^{{x}_{K}}}{{\sum}_{i}{e}^{{x}_{i}}})\hfill \end{array}$$  (47) 
The cross entropy loss is:
$${\mathcal{L}}_{ce}=\sum _{i}{\mathrm{\U0001d7cf}}_{y=i}\mathrm{log}\left({L}_{i}\right)$$  (48) 
Hence the gradient of ${x}_{i}$ is
$${g}_{i}={L}_{i}{\mathrm{\U0001d7cf}}_{y=i}$$  (49) 
Note that ${g}_{i}$ is “errordriven” with the desired output ${\mathrm{\U0001d7cf}}_{y=i}$ and current output ${L}_{i}$. This form is consistent with our errordriven update scheme (c.f. Section 3.2.1 of the paper).
Hinge Loss: Consider the binary classification task. The output of neural network is $x$, and the ground truth label is $y\in \{1,2\}$. Define $({x}_{1},{x}_{2})=(x,x)$. Using lossaugmented step function as the activation function, we have:
$$({L}_{1},{L}_{2})=(H\left({x}_{1}1\right),H\left({x}_{2}1\right))$$  (50) 
where $H(\cdot )$ is the Heaviside step function. The hinge loss is:
$${\mathcal{L}}_{hinge}={\mathrm{\U0001d7cf}}_{y=1}\mathrm{max}\{1{x}_{1},0\}+{\mathrm{\U0001d7cf}}_{y=2}\mathrm{max}\{1{x}_{2},0\}$$  (51) 
Hence the gradient of ${x}_{i}$ is
$${g}_{i}={\mathrm{\U0001d7cf}}_{y=i}\cdot \left({L}_{i}1\right)$$  (52) 
There are two cases. If $y=i$, the gradient ${g}_{i}$ is “errordriven” with the desired output $1$ and current output ${L}_{i}$. If $y\ne i$, the gradient ${g}_{i}$ equals zero, since ${x}_{i}$ does not contribute to the loss. This form is consistent with our errordriven update scheme (c.f. Section 3.2.1 of the paper).
A5. Additional Experiments on SSD
A5.1. Experimental Settings
We also evaluate the proposed APloss on another onestage detector SSD [18]. The models are trained on VOC2007 and VOC2012 trainval sets, and tested on VOC2007 test set. We use VGG16 [32] as the backbone model which is pretrained on the ImageNet1k classification dataset [4]. We use conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2, conv12_2 to predict both location and their corresponding confidences. An additional convolution layer is added after conv4_3 to scale the feature. The associated anchors are the same as that designed in [18]. In testing phase, the input image is fixed to 512$\times $512. For focal loss with SSD, we observe that the hyperparameters $\gamma =1,\alpha =0.25$ lead to a much better performance than the original settings in [15] which are $\gamma =2,\alpha =0.25$. Hence we evaluate the focal loss with new $\gamma $ and $\alpha $ in our experiments on SSD. Other details are similar to that in Section 4.1 of the paper.
A5.2. Results
Training Loss  PASCAL VOC  

AP  AP${}_{50}$  AP${}_{75}$  
CELoss + OHEM  43.6  76.0  44.7 
Focal Loss  39.3  69.9  38.0 
AUCLoss  33.8  63.7  31.5 
APLoss  45.2  77.3  47.3 
The results are shown in Table 4 and Figure 7. Note that the APloss outperforms all the other losses at both the final state and various snapshot time points. Together with the results on RetinaNet [15] in Section 4.2.2 of the paper, we observe the robustness of the proposed APloss, which performs much better than the other competing losses on different datasets (i.e. PASCAL VOC [5], MS COCO [16]) and different detectors (i.e. RetinaNet [15], SSD [18]). This demonstrates the effectiveness and strong generalization ability of our proposed approach.