Unsupervised domain adaptive object detection aims to learn a robust detectorin the domain shift circumstance, where the training (source) domain islabel-rich with bounding box annotations, while the testing (target) domain islabel-agnostic and the feature distributions between training and testingdomains are dissimilar or even totally different. In this paper, we propose agradient detach based stacked complementary losses (SCL) method that usesdetection objective (cross entropy and smooth l1 regression) as the primaryobjective, and cuts in several auxiliary losses in different network stages toutilize information from the complement data (target images) that can beeffective in adapting model parameters to both source and target domains. Agradient detach operation is applied between detection and context sub-networkswith different objectives to force networks to learn more discriminativerepresentations. We argue that the conventional training with primary objectivemainly leverages the information from the source-domain for maximizinglikelihood and ignores the complement data in shallow layers of networks, whichleads to an insufficient integration within different domains. Thus, ourproposed method is a more syncretic adaptation learning process. We conductcomprehensive experiments on seven datasets, the results demonstrate that ourmethod performs favorably better than the state-of-the-art methods by a largemargin. For instance, from Cityscapes to FoggyCityscapes, we achieve 37.9% mAP,outperforming the previous art Strong-Weak by 3.6%.
Quick Read (beta)
SCL: Towards Accurate Domain Adaptive Object Detection via
Gradient Detach Based Stacked Complementary Losses
Unsupervised domain adaptive object detection aims to learn a robust detector in the domain shift circumstance, where the training (source) domain is label-rich with bounding box annotations, while the testing (target) domain is label-agnostic and the feature distributions between training and testing domains are dissimilar or even totally different. In this paper, we propose a gradient detach based stacked complementary losses (SCL) method that uses detection objective  (cross entropy and smooth regression) as the primary objective, and cuts in several auxiliary losses in different network stages to utilize information from the complement data (target images) that can be effective in adapting model parameters to both source and target domains. A gradient detach operation is applied between detection and context sub-networks with different objectives to force networks to learn more discriminative representations. We argue that the conventional training with primary objective mainly leverages the information from the source-domain for maximizing likelihood and ignores the complement data in shallow layers of networks, which leads to an insufficient integration within different domains. Thus, our proposed method is a more syncretic adaptation learning process. We conduct comprehensive experiments on seven datasets, the results demonstrate that our method performs favorably better than the state-of-the-art methods by a large margin. For instance, from Cityscapes to FoggyCityscapes, we achieve 37.9% mAP, outperforming the previous art Strong-Weak  by 3.6%.
In real world scenarios, generic object detection always faces severe challenges from variations in viewpoint, background, object appearance, illumination, occlusion conditions, scene change, etc. These unavoidable factors make object detection in domain-shift circumstance becoming a challenging and new rising research topic in the recent years. Also, domain change is a widely-recognized, intractable problem that urgently needs to break through in reality of detection tasks, like video surveillance, autonomous driving, etc.
Revisiting Domain-Shift Object Detection. Common approaches for tackling domain-shift object detection are mainly in two directions: (i) training supervised model then fine-tuning on the target domain; or (ii) unsupervised cross-domain representation learning. The former requires additional instance-level annotations on target data, which is fairly laborious, expensive and time-consuming. So most approaches focus on the latter one but still have some challenges. The first challenge is that the representations of source and target domain data should be embedded into a common space for matching the object, such as the hidden feature space [26, 4], input space [30, 2] or both of them . The second is that a feature alignment/matching operation or mechanism for source/target domains should be further defined, such as subspace alignment , -divergence and adversarial learning , MRL , Strong-Weak alignment , etc. In general, our SCL is also a learning-based alignment method across domains with an end-to-end framework.
Our Key Ideas. The goal of this paper is to introduce a simple design that is specific to convolutional neural network optimization and improves its training on tasks that adapt on discrepant domains. Unsupervised domain adaptation for recognition has been widely studied by a large body of previous literature [9, 20, 31, 23, 13, 22, 34, 32], our method more or less draws merits from them, like aligning source and target distributions with adversarial learning (domain-invariant alignment). However, object detection is a technically different problem from classification, since we would like to focus more on the object of interests (regions).
Some recent work  has proposed to conduct alignment only on local regions so that to improve the efficiency of model learning. While this operation may cause a deficiency of critical information from context. Inspired by multi-feature/strong-weak alignment [26, 33, 12] which proposed to align corresponding local-region on shallow layers with small respective field (RF) and align image-level features on deep layers with large RF, we extend this idea by studying diverse complementary objectives and their potential combinations for domain adaptive circumstance. Our experiments show that even with the existing objectives, after elaborating the different combinations and training strategy, our method can obtain competitive results. Furthermore, we remove the context vector forwarded from local and global branches in strong-weak alignment , and propose a new sub-network that learns the context features independently with gradient detach updating strategy in a hierarchical manner, we observe this simple design boosts the performance vastly.
The Relation to Complement Objective Training  and Deep Supervision . COL  proposed to involve additional function that complements the primary objective, and updated the parameters alternately with primary and complement objectives. Specifically, cross entropy is used as the primary objective :
where is the label of the -th sample in one-hot representation and is the predicted probabilities.
Th complement entropy is defined in COT  as the average of sample-wise entropies over complement classes in a mini-batch:
where is the entropy function. is the predicted probabilities of complement classes . The training process is that: for each iteration of training, 1) update parameters by first; then 2) update parameters by . In contrast, we don’t use the alternate strategy but update the parameters simultaneously using gradient detach strategy with primary and complement objectives. Since we aim to let the network enable to adapt on both source and target domain data and meanwhile enabling to distinguish objects from them, thus our complement objective design is quite different from COT. We will describe with details in Section 2.
In essence, our method is more likely to be the deeply supervised formulation  that backpropagation of error now proceeds not only from the final layer but also simultaneously from our intermediate complementary outputs. While DSN is basically proposed to alleviate “vanishing” gradient problem, here we focus on how to adopt these auxiliary losses to promote to mix two different domains through domain classifiers for detection. Interestingly, we observe that diverse objectives can lead to better generalization for network adaptation. Motivated by this, we propose Stacked Complementary Losses (SCL), a simple yet effective approach for domain-shift object detection. Our SCL is fairly easy and straight-forward to implement, but can achieve remarkable performance. We conjecture that previous approaches that focus on conducting domain alignment on high-level layers only  cannot fully adapt shallow layer parameters to both source and target domains (even local alignment is applied ) which restricts the ability of model learning. Also, gradient detach is a critical part of learning with our complementary losses. We further visualize the features obtained by non-adapted model, DA , Strong-Weak  and ours, features are from the last layer of backbone before feeding into the Region Proposal Network (RPN). As shown in Figure 1, it is obvious that the target features obtained by our model are more compactly matched with the source domain than any other models.
Contributions. Our contributions are three-fold.
We propose an end-to-end learnable framework that adopts complementary losses for domain adaptive object detection. Our method allows information from source and target domains to be integrated seamlessly.
We propose a gradient detach learning strategy to enable complementary losses to learn a better representation and boost the performance. We also provide extensive ablation studies to empirically verify the effectiveness of each component in our framework design.
To the best of our knowledge, this is a pioneer work to investigate the influence of diverse loss functions and gradient detach for domain adaptive object detection. Thus, this work gives very good intuition and practical guidance with multi-objective learning for domain adaptive object detection. More remarkably, our method achieves the highest accuracy on several domain adaptive or cross-domain object detection benchmarks, which are new records on this task11 1 Our code and models are available at: https://github.com/harsh-99/SCL..
Following the common formulation of domain adaptive object detection, we define a source domain where annotated bound-box is available, and a target domain where only the image can be used in training process without any labels. Our purpose is to train a robust detector that can adapt well to both source and target domain data, i.e., we aim to learn a domain-invariant feature representation that works well for detection across two different domains.
2.1 Multi-Complement Objective Learning
As shown in Figure 2, we focus on the complement objective learning and let where denotes an image, is the corresponding bounding box and category labels for sample , and is an index. Each label denotes a class label where is the category, and a 4-dimension bounding-box coordinate . For the target domain we only use image data for training, so . We define a recursive function for layers where we cut in complementary losses:
where is the feature map produced at layer , is the function to generate features at layer and is input at layer . We formulate the complement loss of domain classifier as follows:
where is the -th domain classifier or discriminator. and denote feature maps from source and target domains respectively. Following [4, 26], we also adopt gradient reverse layer (GRL)  to enable adversarial training where a GRL layer is placed between the domain classifier and the detection backbone network. During backpropagation, GRL will reverse the gradient that passes through from domain classifier to detection network.
For our instance-context alignment loss , we take the instance-level representation and context vector as inputs. The instance-level vectors are from RoI layer that each vector focuses on the representation of local object only. The context vector is from our proposed sub-network that combine hierarchical global features. We concatenate instance features with same context vector. Since context information is fairly different from objects, joint training detection and context networks will mix the critical information from each part, here we proposed a better solution that uses detach strategy to update the gradients. We will introduce it with details in the next section. Aligning instance and context representation simultaneously can help to alleviate the variances of object appearance, part deformation, object size, etc. in instance vector and illumination, scene, etc. in context vector. We define as the domain label of -th training image where for the source and for the target, so the instance-context alignment loss can be further formulated as:
where and denote the numbers of source and target examples. is the output probabilities of the instance-context domain classifier for the -th region proposal in the -th image. So our total SCL objective can be written as:
2.2 Gradients Detach Updating
In this section, we introduce a simple detach strategy which prevents the flow of gradients from context sub-network through the detection backbone path. We find this can help to obtain more discriminative context and we show empirical evidence (see Figure 5) that this path carries information with diversity and hence gradients from this path getting suppressed is superior for such task.
As aforementioned, we define a sub-network to generate the context information from early layers of detection backbone. Intuitively, instance and context will focus on perceptually different parts of an image, so the representations from either of them should also be discrepant. However, if we train with the conventional process, the companion sub-network will be updated jointly with the detection backbone, which may lead to an indistinguishable behavior from these two parts. To this end, in this paper we propose to suppress gradients during backpropagation and force the representation of context sub-network to be dissimilar to the detection network, as shown in Algorithm 2.2. To our best knowledge, this may be the first work to show the effectiveness of gradient detach that can help to learn better context representation for domain adaptive object detection. Although the detach-based method has been adopted in a few work  for better optimization on sequential tasks, our design and motivation are quite different from it. The details of our context sub-network architecture are illustrated in Appendix D.
[h] INPUT: is gradient of context network, is the gradient of detection network, is the detection objective, is the complementary objective;
to 1. Update context net by detection and instance-context objectives: (w/o )+
3. Update detection net by detection and complementary objectives: +
2.3 Framework Overall
Our framework is based on the Faster RCNN , including the Region Proposal Network (RPN) and other modules. The objective of the detection loss is summarized as:
where is the classification loss and is the bounding-box regression loss. To train the whole model using SGD, the overall objective function in the model is:
where is the trade-off coefficient between detection loss and our complementary loss. denotes the RPN and other modules in Faster RCNN. Following [4, 26], we feed one labeled source image and one unlabeled target one in each mini-batch during training.
3 Empirical Results
|AP on a target domain|
|Faster RCNN (Non-adapted)||24.1||33.1||34.3||4.1||22.3||3.0||15.3||26.5||20.3|
|Diversify&match  (CVPR’19)||30.8||40.5||44.3||27.2||38.4||34.5||28.4||32.2||34.6|
|MAF  (ICCV’19)||28.2||39.5||43.9||23.8||39.9||33.3||29.2||33.9||34.0|
|Strong-Weak (Our impl. w/ VGG16)||30.0||40.0||43.4||23.2||40.1||34.6||27.8||33.4||34.1|
|Strong-Weak (Our impl. w/ Res101)||29.1||41.2||43.8||26.0||43.2||27.0||26.2||30.6||33.4|
|Our full model w/ VGG16||31.6||44.0||44.8||30.4||41.8||40.7||33.6||36.2||37.9|
|Upper Bound ||–||–||–||–||–||–||33.2||45.9||49.7||35.6||50.0||37.4||34.7||36.2||40.3|
LS: Least-squares Loss; CE: Cross-entropy Loss; FL: Focal Loss; ILoss: Instance-Context Alignment Loss.
Datasets. We evaluate our approach in three different domain shift scenarios: (1) Similar Domains; (2) Discrepant Domains; and (3) From Synthetic to Real Images. All experiments are conducted on seven domain shift datasets: Cityscapes  to FoggyCityscapes , Cityscapes to KITTI , KITTI to Cityscapes, INIT Dataset , PASCAL  to Clipart , PASCAL to Watercolor , GTA (Sim 10K)  to Cityscapes.
Implementation Details. In all experiments, we resize the shorter side of the image to 600 following [25, 26] with ROI-align . We train the model with SGD optimizer and the initial learning rate is set to , then divided by 10 after every 50,000 iterations. Unless otherwise stated, we set as 1.0 and as 5.0, and we use in our experiments (the analysis of hyper-parameter is shown in Table 7). We report mean average precision (mAP) with an IoU threshold of 0.5 for evaluation.
3.1 How to choose complementary losses
Since there are few pioneer works for exploring the combination of different losses for domain adaptive object detection, here we conduct extensive ablation study for this part to find the best collocation of our SCL method. We follow some objective design from DA and Weak-Strong [4, 26] which provides guidance for us to utilize these losses.
Cross-entropy (CE) Loss. CE loss measures the performance of a classification model whose output is a probability value. It increases as the predicted probability diverges from the actual label:
where is the predicted probability observation of class. is the class label.
Least-squares (LS) Loss. Following , we adopt LS loss to stabilize the training of the domain classifier for aligning low-level features. The loss is designed to align each receptive field of features with the other domain. The least-squares loss is formulated as:
where denotes the output of the domain classifier in each location .
Focal Loss (FL). Focal loss  is adopted to ignore easy-to-classify examples and focus on those hard-to-classify ones during training:
3.2 Ablation Studies from Cityscapes to FoggyCityscapes
We first investigate each component and design of our SCL framework from Cityscapes to FoggyCityscapes. Both source and target datasets have 2,975 images in the training set and 500 images in the validation set. We design several controlled experiments for this ablation study. A consistent setting is imposed on all the experiments, unless when some components or structures are examined. In this study, we train models with the ImageNet  pre-trained ResNet-101 as a backbone, we also provide the results with pre-trained VGG16 model.
The results are summarized in Table 1. We present several combinations of four complementary objectives with their loss names and performance. We observe that “———” obtains the best accuracy with Context and Detach. Furthermore, our proposed method performed much better than baseline Strong-Weak  (37.9% vs.34.3%) and other state-of-the-arts.
3.3 Similar Domains
Between Cityspaces and KITTI. In this part, we focus on studying adaptation between two real and similar domains, as we take KITTI and Cityscapes as our training and testing data. Following , we use KITTI training set which contains 7,481 images. We conduct experiments on both adaptation directions K C and C K and evaluate our method using AP of car as in DA.
As shown in Table 2, our proposed method performed much better than the baseline and other state-of-the-art methods. Since Strong-Weak  didn’t provide the results on this dataset, we re-implement it and obtain 37.9% AP on KC and 71.0% AP on CK. Our method is 4% higher than the former and 1.7% higher than latter. If comparing to the non-adapted results (source only), our method outperforms it with a huge margin (about 10% and 20% higher, respectively).
|DA (Our impl.) ||35.6||70.8|
|WS (Our impl.) ||37.9||71.0|
INIT Dataset. INIT Dataset  contains 132,201 images for training and 23,328 images for testing. There are four domains: sunny, night, rainy and cloudy, and three instance categories, including: car, person, speed limited sign. This dataset is first proposed for the instance-level image-to-image translation task, here we use it for the domain adaptive object detection purpose.
Our results are shown in Table 3. Following , we conduct experiments on three domain pairs: sunnynight (s2n), sunnyrainy (s2r) and sunnycloudy (s2c). Since the training images in rainy domain are much fewer than sunny, for s2r experiment we randomly sample the training data in sunny set with the same number of rainy set and then train the detector. It can be observed that our method is consistently better than the baseline method. We don’t provide the results of s2c (faster) because we found that cloudy images are too similar to sunny in this dataset (nearly the same), thus the non-adapted result is very close to the adapted methods.
3.4 Discrepant Domains
In this section, we focus on the dissimilar domains, i.e., adaptation from real images to cartoon/artistic. Following , we use PASCAL VOC dataset (2007+2012 training and validation combination for training) as the source data and the Clipart or Watercolor  as the target data. The backbone network is ImageNet pre-trained ResNet-101.
PASCAL to Clipart. Clipart dataset contains 1,000 images in total, with the same 20 categories as in PASCAL VOC. As shown in Table 4, our proposed SCL outperforms all baselines. In addition, we observe that replacing with loss on instance-context classifier can further improve the performance from 40.6% to 41.5%. More ablation results are shown in our Appendix A.2 (Table 13).
PASCAL to WaterColor. Watercolor dataset contains 6 categories in common with PASCAL VOC and has totally 2,000 images (1,000 images are used for training and 1,000 test images for evaluation). Results are summarized in Table 5, our SCL consistently outperforms other state-of-the-arts.
3.5 From Synthetic to Real Images
Sim10K to Cityscapes. Sim 10k dataset  contains 10,000 images for training which are generated by the gaming engine Grand Theft Auto (GTA). Following [4, 26], we use Cityscapes as target domain and evaluate our models on Car class. Our result is shown in Table 6, which consistently outperforms the baselines.
|AP on a target domain|
|Method||AP on Car|
Hyper-parameter . Table 7 shows the results for sensitivity of hyper-parameter in Figure 2. This parameter controls the number of SCL losses and context branches. It can be observed that the proposed method performs best when on all three datasets.
|from Cityscapes to Foggycityscapes||32.7||37.9||34.5|
|from PASCAL VOC to Clipart||39.0||41.5||39.3|
|from PASCAL VOC to Watercolor||54.7||55.2||53.4|
Parameter Sensitivity on and . Figure 3 shows the results for parameter sensitivity of and in Eq. 8 and Eq. 11. is the trade-off parameter between SCL and detection objectives and controls the strength of hard samples in Focal Loss. We conduct experiments on two adaptations: Cityscapes FoggyCityscapes (blue) and Sim10K Cityscapes (red). On Cityscapes FoggyCityscapes, we achieve the best performance when and and the best accuracy is 37.9%. On Sim10K Cityscapes, the best result is obtained when , .
Analysis of IoU Threshold. The IoU threshold is an important indicator to reflect the quality of detection, and a higher threshold means better coverage with ground-truth. In our previous experiments, we use 0.5 as a threshold suggested by many literature [25, 4]. In order to explore the influence of IoU threshold with performance, we plot the performance vs. IoU on three datasets. As shown in Figure 4, our method is consistently better than the baselines on different threshold by a large margin (in most cases).
Why Gradient Detach Can Help Our Model? To further explore why gradient detach can help to improve performance vastly and what our model really learned, we visualize the heatmaps on both source and target images from our models w/o and w/ detach training. As shown in Figure 5, the visualization is plotted with feature maps after Conv B3 in Figure 2. We can observe that the object areas and context from detach-trained models have stronger contrast than w/o detach model (red and blue areas). This indicates that detach-based model can learn more discriminative features from the target object and context. More visualizations are shown in Appendix F (Figure 8).
Detection Visualization. Figure 9 shows several qualitative comparisons of detection examples on three test sets with DA , Strong-Weak  and our SCL models. Our method detects more small and blurry objects in dense scene (FoggyCityscapes) and suppresses more false positives (Clipart and Watercolor) than the other two baselines.
In this paper, we have addressed unsupervised domain adaptive object detection through stacked complementary losses. One of our key contributions is gradient detach training, enabled by suppressing gradients flowing back to the detection backbone. In addition, we proposed to use multiple complementary losses for better optimization. We conduct extensive experiments and ablation studies to verify the effectiveness of each component that we proposed. Our experimental results outperform the state-of-the-art approaches by a large margin on a variety of benchmarks. Our future work will focus on exploring the domain-shift detection from scratch, i.e., without the pre-trained models like DSOD , to avoid involving bias from the pre-trained dataset.
-  (2019) H-detach: modifying the lstm gradient towards better optimization. In ICLR, Cited by: §2.2.
-  (2019) Exploring object relation in mean teacher for cross-domain detection. In CVPR, Cited by: §1.
-  (2019) Complement objective training. In ICLR, Cited by: §1, §1.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In CVPR, Cited by: Appendix B, (b)b, (f)f, §1, §1, §2.1, §2.3, §3.1, §3.3, §3.5, Table 2, Figure 6, §4, §4.
-  (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §3.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.2.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.
-  (2015) Unsupervised domain adaptation by backpropagation. In ICML, Cited by: §2.1.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research. Cited by: §1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.
-  (2017) Mask r-cnn. In ICCV, Cited by: §3.
-  (2019) Multi-adversarial faster-rcnn for unrestricted object detection. In ICCV, Cited by: §1, Table 1.
-  (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, Cited by: §1.
-  (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, Cited by: §3.4, §3.
-  (2016) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. arXiv preprint arXiv:1610.01983. Cited by: §3.5, §3.
-  (2019) Self-training and adversarial background regularization for unsupervised domain adaptive one-stage object detection. In ICCV, Cited by: Table 4.
-  (2019) Diversify and match: a domain adaptive representation learning paradigm for object detection. In CVPR, Cited by: §1, Table 1.
-  (2015) Deeply-supervised nets. In Artificial intelligence and statistics, Cited by: §1, §1.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.1.
-  (2016) Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2008) Visualizing data using t-sne. Journal of machine learning research. Cited by: Figure 7, Figure 1.
-  (2018) Image to image translation for domain adaptation. In CVPR, Cited by: §1.
-  (2017) Open set domain adaptation. In ICCV, Cited by: §1.
-  (2015) Subspace alignment based domain adaptation for rcnn detector. arXiv preprint arXiv:1507.05578. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, Cited by: SCL: Towards Accurate Domain Adaptive Object Detection via Gradient Detach Based Stacked Complementary Losses, §2.3, Table 2, §3, §4.
-  (2019) Strong-weak distribution alignment for adaptive object detection. In CVPR, Cited by: SCL: Towards Accurate Domain Adaptive Object Detection via Gradient Detach Based Stacked Complementary Losses, (c)c, (g)g, §1, §1, §1, §2.1, §2.3, §3.1, §3.1, §3.2, §3.3, §3.4, §3.5, Table 1, Table 2, Table 4, §3, Figure 6, §4.
-  (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: §3.
-  (2019) Towards instance-level image-to-image translation. In CVPR, Cited by: §3.3, §3.3, §3.
-  (2017) Dsod: learning deeply supervised object detectors from scratch. In ICCV, Cited by: §5.
-  (2018) SPLAT: semantic pixel-level adaptation transforms for detection. arXiv preprint arXiv:1812.00929. Cited by: §1.
-  (2017) Adversarial discriminative domain adaptation. In CVPR, Cited by: §1.
-  (2019) Domain adaptation with asymmetrically-relaxed distribution alignment. In ICML, Cited by: §1.
-  (2018) Collaborative and adversarial network for unsupervised domain adaptation. In CVPR, Cited by: §1.
-  (2019) On learning invariant representations for domain adaptation. In ICML, Cited by: §1.
-  (2019) Adapting object detectors via selective cross-domain alignment. In CVPR, Cited by: §1.
Appendix A More Ablation Studies
Table 8 and 13 show the detailed results on target domains when conducting adaptation from PASCAL VOC to WaterColor and from PASCAL VOC to Clipart dataset. We present results with different combinations of SCL and diverse ablation experiments.
A.1 From Pascal VOC to Watercolor Dataset
|AP on a target domain|
|W/O CLoss ()||77.1||53.1||49.6||41.0||39.3||67.9||54.7|
A.2 From Pascal VOC to Clipart Dataset
The results are shown in Table 13.
Appendix B Results on Source Domains
In this section, we show the adaptation results on source domains in Table 9, 12, 12 and 12. Surprisingly, we observe that the best-trained models (on target domains) are not performing best on the source data, e.g., from PASCAL VOC to WaterColor, DA  obtained the highest results on source domain images (although the gaps with Strong-Weak and ours are marginal). We conjecture that the adaptation process for target domains will affect the learning and performing on source domains, even we have used the bounding box ground-truth on source data for training. We will investigate it more thoroughly in our future work and we think the community may also need to rethink whether evaluating on source domain should be a metric for domain adaptive object detection, since it can help to understand the behavior of models on both source and target images.
|AP on a source domain|
|AP on a source domain|
Appendix C Detailed Results of Parameter Sensitivity on and
|W/O CLoss ()||33.1||57.0||32.5||24.6||39.0||55.9||37.3||15.7||39.5||50.7||20.5||19.8||37.7||75.3||60.8||43.9||21.1||26.2||42.9||45.6||39.0|
|AP on a target domain|
|AP on a target domain|
Appendix D Context Network
Our context networks are shown in Table 14. We use three branches (forward networks) to deliver the context information and each branch generates a 128-dimension feature vector from the corresponding backbone layers of SCL. Then we naively concatenate them and obtain the final context feature with a 384-dimension vector.
Appendix E Visualization of Intermediate Feature Embedding
In this section, we visualize the intermediate feature embedding on three adaptation datasets. As shown in Figure 7, the gradient detach-based models can adapt source and target images to a similar distribution better than w/o detach models.