Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

  • 2019-08-12 15:28:47
  • Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Woo, Ruxin Chen, Jian Zheng
  • 0

Abstract

Although various image-based domain adaptation (DA) techniques have beenproposed in recent years, domain shift in videos is still not well-explored.Most previous works only evaluate performance on small-scale datasets which aresaturated. Therefore, we first propose two large-scale video DA datasets withmuch larger domain discrepancy: UCF-HMDB_full and Kinetics-Gameplay. Second, weinvestigate different DA integration methods for videos, and show thatsimultaneously aligning and learning temporal dynamics achieves effectivealignment even without sophisticated DA methods. Finally, we propose TemporalAttentive Adversarial Adaptation Network (TA3N), which explicitly attends tothe temporal dynamics using domain discrepancy for more effective domainalignment, achieving state-of-the-art performance on four video DA datasets(e.g. 7.9% accuracy gain over "Source only" from 73.9% to 81.8% on "HMDB -->UCF", and 10.3% gain on "Kinetics --> Gameplay"). The code and data arereleased at http://github.com/cmhungsteve/TA3N.

 

Quick Read (beta)

Temporal Attentive Alignment for Large-Scale Video Domain Adaptation

Min-Hung Chen1     Zsolt Kira1  Ghassan AlRegib1  Jaekwon Woo2  Ruxin Chen2  Jian Zheng311footnotemark: 1
1Georgia Institute of Technology  2Sony Interactive Entertainment LLC  3Binghamton University
Work partially done as a SIE intern
Abstract

Although various image-based domain adaptation (DA) techniques have been proposed in recent years, domain shift in videos is still not well-explored. Most previous works only evaluate performance on small-scale datasets which are saturated. Therefore, we first propose two large-scale video DA datasets with much larger domain discrepancy: UCF-HMDBfull and Kinetics-Gameplay. Second, we investigate different DA integration methods for videos, and show that simultaneously aligning and learning temporal dynamics achieves effective alignment even without sophisticated DA methods. Finally, we propose Temporal Attentive Adversarial Adaptation Network (TA𝟑N), which explicitly attends to the temporal dynamics using domain discrepancy for more effective domain alignment, achieving state-of-the-art performance on four video DA datasets (e.g. 7.9% accuracy gain over “Source only” from 73.9% to 81.8% on “HMDB UCF”, and 10.3% gain on “Kinetics Gameplay”). The code and data are released at http://github.com/cmhungsteve/TA3N.

1 Introduction

Figure 1: An overview of proposed TA3N for video DA. In addition to spatial discrepancy between frame images, videos also suffer from temporal discrepancy between sets of time-ordered frames that contain multiple local temporal dynamics with different contributions to the overall domain shift, as indicated by the thickness of green dashed arrows. Therefore, we propose to focus on aligning the temporal dynamics which have higher domain discrepancy using a learned attention mechanism to effectively align the temporal-embedded feature space for videos. Here we use the action basketball as the example.

Domain adaptation (DA) [32] has been studied extensively in recent years [5] to address the domain shift problem [37, 34], which means the models trained on source labeled dataset do not generalize well to target datasets and tasks. DA is categorized in terms of the availability of annotations in the target domain. In this paper, we focus on the harder unsupervised DA problem, which requires training models that can generalize to target samples without access to any target labels. While many unsupervised DA approaches are able to diminish the distribution gap between source and target domains while learning discriminative deep features [25, 27, 11, 12, 24, 23, 39], most methods have been developed only for images and not videos.

Furthermore, unlike image-based DA work, there do not exist well-organized datasets to evaluate and benchmark the performance of DA algorithms for videos. The most common datasets are UCF-Olympic and UCF-HMDBsmall [44, 52, 17], which have only a few overlapping categories between source and target domains. This introduces limited domain discrepancy so that a deep CNN architecture can achieve nearly perfect performance even without any DA method (details in creftypecap 5.2 and creftypecap 2). Therefore, we propose two larger-scale datasets to investigate video DA: 1) UCF-HMDBfull: We collect 12 overlapping categories between UCF101 [43] and HMDB51 [21], which is around three times larger than both UCF-Olympic and UCF-HMDBsmall, and contains larger domain discrepancy (details in creftypecap 5.2 and creftypepluralcap 4\crefpairconjunction3). 2) Kinetics-Gameplay: We collect from several currently popular video games with 30 overlapping categories with Kinetics-600 [19, 2]. This dataset is much more challenging than UCF-HMDBfull due to the significant domain shift between the distributions of virtual and real data.

Videos can suffer from domain discrepancy along both the spatial and temporal directions, bringing the need of alignment for embedded feature spaces along both directions, as shown in creftypecap 1. However, most DA approaches have not explicitly addressed the domain shift problem in the temporal direction. Therefore, we first investigate different DA integration methods for video classification and show that: 1) aligning the features that encode temporal dynamics outperforms aligning only spatial features. 2) to effectively align domains spatio-temporally, which features to align is more important than what DA approaches to use. To support our claims, we then propose Temporal Adversarial Adaptation Network (TA2N), which simultaneously aligns and learns temporal dynamics, outperforming other approaches which naively apply more sophisticated image-based DA methods for videos.

The temporal dynamics in videos can be represented as a combination of multiple local temporal features corresponding to different motion characteristics. Not all of the local temporal features equally contribute to the overall domain shift. We want to focus more on aligning those which have high contribution to the overall domain shift, such as the local temporal features connected by thicker green arrows shown in creftypecap 1. Therefore, we propose Temporal Attentive Adversarial Adaptation Network (TA𝟑N) to explicitly attend to the temporal dynamics by taking into account the domain distribution discrepancy. In this way, the temporal dynamics which contribute more to the overall domain shift will be focused on, leading to more effective temporal alignment. TA3N achieves state-of-the-art performance on all four investigated video DA datasets.

In summary, our contributions are three-fold:

  1. 1.

    Video DA Dataset Collection: We collect two large-scale video DA datasets, UCF-HMDBfull and Kinetics-Gameplay, to investigate the domain discrepancy problem across videos, which is an under-explored research problem. To our knowledge, they are by far the largest datasets for video DA problems.

  2. 2.

    Feature Alignment Exploration for Video DA: We investigate different DA integration approaches for videos and provide a strategy to effectively align domains spatio-temporally for videos by aligning temporal relation features. We propose this simple but effective approach, TA2N, to demonstrate the importance of determining what to align over the DA method to use.

  3. 3.

    Temporal Attentive Adversarial Adaptation Network (TA𝟑N): We propose TA3N, which simultaneously aligns domains, encodes temporal dynamics into video representations, and attends to representations with domain distribution discrepancy. TA3N achieves state-of-the-art performance on both small- and large-scale cross-domain video datasets.

2 Related Works

Video Classification. With the rise of deep convolutional neural networks (CNNs), recent work for video classification mainly aims to learn compact spatio-temporal representations by leveraging CNNs for spatial information and designing various architectures to exploit temporal dynamics [18]. In addition to separating spatial and temporal learning, some works propose different architectures to encode spatio-temporal representations with consideration of the trade-off between performance and computational cost [46, 3, 36, 47]. Another branch of work utilizes optical flow to compensate for the lack of temporal information in raw RGB frames [42, 9, 49, 3, 29]. Moreover, some works extract temporal dependencies between frames for video tasks by utilizing recurrent neural networks (RNNs) [6], attention [28, 30] and relation modules [57]. Note that we focus on attending to the temporal dynamics to effectively align domains and we consider other modalities, e.g. optical flow, to be complementary to our method.

Domain Adaptation. Most recent DA approaches are based on deep learning architectures designed for addressing the domain shift problems given the fact that the deep CNN features without any DA method outperform traditional DA methods using hand-crafted features [7]. Most DA approaches follow the two-branch (source and target) architecture, and aim to find a common feature space between the source and target domains. The models are therefore optimized with a combination of classification and domain losses [5].

One of the main classes of methods used is Discrepancy-based DA, whose metrics are designed to measure the distance between source and target feature distributions, including variations of maximum mean discrepancy (MMD) [25, 26, 54, 53, 27] and the CORAL function [45]. By diminishing the distance of distributions, discrepancy-based DA methods reduce the gap across domains. Another common method, Adversarial-based DA, adopts a similar concept as GANs [13] by integrating domain discriminators into the architectures. Through the adversarial objectives, the discriminators are optimized to classify different domains, while the feature extractors are optimized in the opposite direction. ADDA [48] uses an inverted label GAN loss to split the optimization into two parts: one for the discriminator and the other for the generator. In contrast, the gradient reversal layer (GRL) is used in some work [11, 12, 55] to invert the gradients so that the discriminator and generator are optimized simultaneously. Additionally, Normalization-based DA [24, 23] adapts batch normalization [16] to DA problems by calculating two separate statistics, representing source and target, for normalization. Furthermore, Ensemble-based DA [10, 38, 39, 22] builds a target branch ensemble by incorporating multiple target branches. Recently, TADA [51] adopts the attention mechanism to adapt the transferable regions. We extend these concepts to spatio-temporal domains, aiming to attend to the important parts of temporal dynamics for alignment.

Video Domain Adaptation. Unlike image-based DA, video-based DA is still an under-explored area. Only a few works focus on small-scale video DA with only a few overlapping categories [44, 52, 17]. [44] improves the domain generalizability by decreasing the effect of the background. [52] maps source and target features to a common feature space using shallow neural networks. AMLS [17] adapts pre-extracted C3D [46] features on a Grassmann manifold obtained using PCA. However, the datasets used in the above works are too small to have enough domain shift to evaluate DA performance. Therefore, we propose two larger cross-domain datasets UCF-HMDBfull and Kinetics-Gameplay, and provide benchmarks with different baseline approaches. Recently, TSRNet [56] transfers knowledge for action localization using MMD, but only aligns the video-level features. Instead, our TA3N simultaneously attends, aligns, and encodes temporal dynamics into video features.

3 Technical Approach

We first introduce our baseline model which simply extends image-base DA for videos using the temporal pooling mechanism (creftypecap 3.1). And then we investigate better ways to incorporate temporal dynamics for video DA (creftypecap 3.2), and describe our final proposed method with the domain attention mechanism (creftypecap 3.3).

3.1 Baseline Model

Given the recent success of large-scale video classification using CNNs [18], we build our baseline on such architectures, as shown in the lower part of Figure 2.

Figure 2: Baseline architecture (TemPooling) with the adversarial discriminators G^sd and G^td. y is the class prediction loss, and sd and td are the domain losses. See the detailed architecture in the supplementary material.

We first feed the input video Xi={xi1,xi2,,xiK} extracted from ResNet [14] pre-trained on ImageNet into our model, where xij is the jth frame-level feature representation of the ith video. The model can be divided into two parts: 1) Spatial module Gsf(.;θsf), which consists of multilayer perceptrons (MLP) that aims to convert the general-purpose feature vectors into task-driven feature vectors, where the task is video classification in this paper; 2) Temporal module Gtf(.;θtf) aggregates the frame-level feature vectors to form a single video-level feature vector for each video. In our baseline architecture, we conduct mean-pooling along the temporal direction to generate video-level feature vectors, and note it as TemPooling. Finally, another fully-connected layer Gy(.;θy) converts the video-level features into the final predictions, which are used to calculate the class prediction loss y.

Similar to image-based DA problems, the baseline approach is not able to generalize to data from different domains due to domain shift. Therefore, we integrate TemPooling with the unsupervised DA method inspired by one of the most popular adversarial-based approaches, DANN [11, 12]. The main idea is to add additional domain classifiers Gd(.;θd), to discriminate whether the data is from the source or target domain. Before back-propagating the gradients to the main model, a gradient reversal layer (GRL) is inserted between Gd and the main model to invert the gradient, as shown in Figure 2. During adversarial training, the parameters θsf are learned by maximizing the domain discrimination loss d, and parameters θd are learned by minimizing d with the domain label d. Therefore, the feature generator Gf will be optimized to gradually align the feature distributions between the two domains.

In this paper, we note the Adversarial Discriminator G^d as the combination of a gradient reversal layer (GRL) and a domain classifier, and insert G^d into TemPooling in two ways: 1) G^sd: show how directly applying image-based DA approaches can benefit video DA; 2) G^td: indicate how DA on temporal-dynamics-encoded features benefits video DA.

The prediction loss y, spatial domain loss sd and temporal domain loss td can be expressed as follows (ignoring all the parameter symbols through the paper to save space):

yi=Ly(Gy(Gtf(Gsf(Xi))),yi) (1)
sdi=1Kj=1KLd(Gsd(Gsf(xij)),di) (2)
tdi=Ld(Gtd(Gtf(Gsf(Xi))),di) (3)

where K is the number of frames sampled from each video. L is the cross entropy loss function.

The overall loss can be expressed as follows:

=1NSi=1NSyi-1NSTi=1NST(λssdi+λttdi) (4)

where NS equals the number of source data, and NST equals the number of all data. λs and λt is the trade-off weighting for spatial and temporal domain loss.

3.2 Integration of Temporal Dynamics with DA

One main drawback of directly integrating image-based DA approaches into our baseline architecture is that the feature representations learned in the model are mainly from the spatial features. Although we implicitly encode the temporal information by the temporal pooling mechanism, the relation between frames is still missing. Therefore, we would like to address two questions: 1) Does the video DA problem benefit from encoding temporal dynamics into features? 2) Instead of only modifying feature encoding methods, how can DA be further integrated while encoding temporal dynamics into features?

To answer the first question, given the fact that humans can recognize actions by reasoning the observations across time, we propose the TemRelation architecture by replacing the temporal pooling mechanism with the Temporal Relation module, which is modified from [41, 57], as shown in Figure 4.

The n-frame temporal relation is defined by the function:

Rn(Vi)=mgϕ(n)((Vin)m) (5)

where (Vin)m={via,vib,}m is the mth set of frame-level representations from n temporal-ordered sampled frames. a and b are the frame indices. We fuse the feature vectors that are time-ordered with the function gϕ(n), which is an MLP with parameters ϕ(n). To capture temporal relations at multiple time scales, we sum up all the n-frame relation features into the final video representation. In this way, the temporal dynamics are explicitly encoded into features. We then insert G^d into TemRelation as we did for TemPooling.

Although aligning temporal-dynamic-encoded features benefits video DA, feature encoding and DA are still two separate processes, leading to sub-optimal DA performance. Therefore, we address the second question by proposing Temporal Adversarial Adaptation Network (TA2N), which explicitly integrates G^d inside the Temporal module to align the model across domains while learning temporal dynamics. Specifically, we integrate each n-frame relation with a corresponding relation discriminator G^rdn because different n-frame relations represent different temporal characteristics, which correspond to different parts of actions. The relation domain loss rd can be expressed as follows:

rdi=1K-1n=2KLd(Grdn(Rn(Gsf(Xi))),di) (6)

The experimental results show that our integration strategy can effectively align domains spatio-temporally for videos, and outperform those which are extended from sophisticated DA approaches although TA2N is adopted from a simpler DA method (DANN) (see details in creftypepluralcap 5\crefmiddleconjunction4\creflastconjunction3).

3.3 Temporal Attentive Alignment for Videos

The final video representation of TA2N is generated by aggregating multiple local temporal features. Although aligning temporal features across domains benefits video DA, not all the features are equally important to align. In order to effectively align overall temporal dynamics, we want to focus more on aligning the local temporal features which have larger domain discrepancy. Therefore, we represent the final video representation as a combination of local temporal features with different attention weighting, as shown in creftypecap 3, and aim to attend to features of interest that are domain discriminative so that the DA mechanism can focus on aligning those features. The main question becomes: How to incorporate domain discrepancy for attention?

Figure 3: The domain attention mechanism in TA3N. Thicker arrows corresponds to larger attention weights.

To address this, we propose Temporal Attentive Adversarial Adaptation Network (TA3N), as shown in creftypecap 4, by introducing the domain attention mechanism, which utilize the entropy criterion to generate the domain attention value for each n-frame relation feature as below:

win=1-H(d^in) (7)

where d^in is the output of Grdn for the ith video. H(p)=-kpklog(pk) is the entropy function to measure uncertainty. win increases when H(d^in) decreases, which means the domains can be distinguished well. We also add a residual connection for more stable optimization. Therefore, the final video feature representation hi generated from attended local temporal features, which are learned by local temporal modules Gtf(n), can be expressed as:

hi=n=2K(win+1)Gtf(n)(Gsf(Xi)) (8)
Figure 4: The overall architecture of the proposed Temporal Attentive Adversarial Adaptation Network (TA3N). In the temporal relation module, time-ordered frames are used to generate K-1 relation feature representations 𝑹={R2,,RK}, where Rn corresponds to the n-frame relation (the numbers in this figure are examples of time indices). After attending with the domain predictions from relation discriminators Grdn, the relation features are summed up to the final video representation. The attentive entropy loss ae, which is calculated by domain entropy H(d^) and class entropy H(y^), aims to enhance the certainty of those videos that are more similar across domains. See the detailed architecture in the supplementary material.

Finally, we add the minimum entropy regularization to refine the classifier adaptation. However, we only want to minimize the entropy for the videos that are similar across domains. Therefore, we attend to the videos which have low domain discrepancy, so that we can focus more on minimizing the entropy for these videos. The attentive entropy loss ae can be expressed as follows:

aei=(1+H(d^i))H(y^i) (9)

where d^i and y^i is the output of Gtd and Gy, respectively. We also adopt the residual connection for stability.

By combining creftypepluralcap 9\crefmiddleconjunction6\crefmiddleconjunction3\crefmiddleconjunction2\creflastconjunction1, and replacing Gsf and Gtf with hi by creftypecap 8, the overall loss of TA3N can be expressed as follows:

=1NSi=1NSyi+1NSTi=1NSTγaei-1NSTi=1NST(λssdi+λrrdi+λttdi) (10)

where λs, λr and λt is the trade-off weighting for each domain loss. γ is the weighting for the attentive entropy loss. All the weightings are chosen via grid search.

Our proposed TA3N and TADA [51] both utilize entropy functions for attention but with different perspectives. TADA aims to focus on the foreground objects for image DA, while TA3N aims to find important and discriminative parts of temporal dynamics to align for video DA.

4 Datasets

There are very few benchmark datasets for video DA, and only small-scale datasets have been widely used [44, 52, 17]. Therefore, we specifically create two cross-domain datasets to evaluate the proposed approaches for the video DA problem, as shown in creftypecap 1. For more details about the datasets, please refer to the supplementary material.

UCF-HMDBsmall UCF-Olympic UCF-HMDBfull Kinetics-Gameplay
length (sec.) 1 - 21 1 - 39 1 - 33 1 - 10
class # 5 6 12 30
video # 1171 1145 3209 49998
Table 1: The comparison of the cross-domain video datasets.

4.1 UCF-HMDB𝒇𝒖𝒍𝒍

We extend UCF-HMDBsmall [44], which only selects 5 visually highly similar categories, by collecting all of the relevant and overlapping categories between UCF101 [43] and HMDB51 [21], which results in 12 categories. We follow the official split method to separate training and validation sets. This dataset, UCF-HMDBfull, includes more than 3000 video clips, which is around 3 times larger than UCF-HMDBsmall and UCF-Olympic.

4.2 Kinetics-Gameplay

In addition to real-world videos, we are also interested in virtual-world videos for DA. While there are more than ten real-world video datasets, there is a limited number of virtual-world datasets for video classification. It is mainly because rendering realistic human actions using game engines requires gaming graphics expertise which is time-consuming. Therefore, we create the Gameplay dataset by collecting gameplay videos from currently popular video games, Detroit: Become Human and Fortnite, to build our own video dataset for the virtual domain. For the real domain, we use one of the largest public video datasets Kinetics-600 [19, 2]. We follow the closed-set DA setting [34] to select 30 overlapping categories between the Kinetics-600 and Gameplay datasets to build the Kinetics-Gameplay dataset with both domains, including around 50K video clips. See the supplementary material for the complete statistics and example snapshots.

5 Experiments

We therefore evaluate DA approaches on four datasets: UCF-Olympic, UCF-HMDBsmall, UCF-HMDBfull and Kinetics-Gameplay.

5.1 Experimental Setup

UCF-Olympic and UCF-HMDBsmall. First, we evaluate our approaches on UCF-Olympic and UCF-HMDBsmall, and compare with all other works that also evaluate on these two datasets [44, 52, 17]. We follow the default settings, but the method to split the UCF video clips into training and validations sets is not specified in these papers, so we follow the official split method from UCF101 [43].

UCF-HMDBfull and Kinetics-Gameplay. For the self-collected datasets, we follow the common experimental protocol of unsupervised DA [34]: the training data consists of labeled data from the source domain and unlabeled data from the target domain, and the validation data is all from the target domain. However, unlike most of the image DA settings, our training and validation data in both domains are separate to avoid potentially overfitting while aligning different domains. To compare with image-based DA approaches, we extend several state-of-the-art methods  [12, 27, 23, 39] for video DA with our TemPooling and TemRelation architectures, as shown in creftypepluralcap 5\crefmiddleconjunction4\creflastconjunction3. The difference between the “Target only” and “Source only” settings is the domain used for training. The “Target only” setting can be regarded as the upper bound without domain shift while the “Source only” setting shows the lower bound which directly applies the model trained with source data to the target domain without modification. See supplementary materials for full implementation details.

5.2 Experimental Results

UCF-Olympic and UCF-HMDBsmall. In these two datasets, our approach outperforms all the previous methods by at least 6.5% absolute difference (98.15% - 91.60%) on the “U O” setting, and 9% difference (99.33% - 90.25%) on the “U H” setting, as shown in Table 2.

These results also show that the performance on these datasets is saturated. With a strong CNN as the backbone architecture, even our baseline architecture TemPooling can achieve high accuracy without any DA method (e.g. 96.3% for “U O”). This suggests that these two datasets are not enough to evaluate more sophisticated DA approaches, so larger-scale datasets for video DA are needed.

Source Target U O O U U H H U
W. Sultani et al. [44] 33.33 47.91 68.70 68.67
T. Xu et al.  [52] 87.00 75.00 82.00 82.00
AMLS (GFK) [17] 84.65 86.44 89.53 95.36
AMLS (SA) [17] 83.92 86.07 90.25 94.40
DAAA [17] 91.60 89.96 - -
TemPooling 96.30 87.08 98.67 97.35
TemPooling + DANN [12] 98.15 90.00 99.33 98.41
Ours (TA2N) 98.15 91.67 99.33 99.47
Ours (TA3N) 98.15 92.92 99.33 99.47
Table 2: The accuracy (%) for the state-of-the-art work on UCF-Olympic and UCF-HMDBsmall (U: UCF, O: Olympic, H: HMDB). We only show their results which are fine-tuned with source data for fair comparison. Please refer to the supplementary material for more details. [17] did not test DAAA on UCF-HMDBsmall.

UCF-HMDBfull. We then evaluate our approaches and compare with other image-based DA approaches on the UCF-HMDBfull dataset, as shown in creftypepluralcap 4\crefpairconjunction3. The accuracy difference between “Target only” and “Source only” indicates the domain gap. The gaps for the HMDB dataset are 11.11% for TemRelation and 10.28% for TemPooling (see creftypecap 3), and the gaps for the UCF dataset are 21.01% for TemRelation and 17.16% for TemPooling (see creftypecap 4). It is worth noting that the “Source only” accuracy of our baseline architecture (TemPooling) on UCF-HMDBfull is much lower than UCF-HMDBsmall (e.g. 28.39 lower for “U H”), which implies that UCF-HMDBfull contains much larger domain discrepancy than UCF-HMDBsmall. The value “Gain” is the difference from the “Source only” accuracy, which directly indicates the effectiveness of the DA approaches. We now answer the two questions for video DA in Section 3.2 (see creftypepluralcap 4\crefpairconjunction3):

  1. 1.

    Does the video DA problem benefit from encoding temporal dynamics into features?

    From creftypepluralcap 4\crefpairconjunction3, we see that for the same DA method, TemRelation outperforms TemPooling in most cases, especially for the gain value. For example, “TemPooling+DANN” reaches 0.83% absolute accuracy gain on the “U H” setting and 0.17% gain on the “H U” setting while “TemRelation+DANN” reaches 3.61% gain on “U H” and 2.45% gain on “H U”. This means that applying DA approaches to the video representations which encode the temporal dynamics improves the overall performance for cross-domain video classification.

  2. 2.

    How to further integrate DA while encoding temporal dynamics into features?

    Although integrating TemRelation with image-based DA approaches generally has better alignment performance than the baseline (TemPooling), feature encoding and DA are still two separate processes. The alignment happens only before and after the temporal dynamics are encoded in features. In order to explicitly force alignment of the temporal dynamics across domains, we propose TA2N, which reaches 77.22% (5.55% gain) on “U H” and 80.56% (6.66% gain) on “H U”. creftypepluralcap 4\crefpairconjunction3 show that although TA2N is adopted from a simple DA method (DANN), it still outperforms other approaches which are extended from more sophisticated DA methods but do not follow our strategy.

Finally, with the domain attention mechanism, our proposed TA𝟑N reaches 78.33% (6.66% gain) on “U H” and 81.79% (7.88% gain) on “H U”, achieving state-of-the-art performance on UCF-HMDBfull in terms of accuracy and gain, as shown in creftypepluralcap 4\crefpairconjunction3.

Temporal Module TemPooling TemRelation
Acc. Gain Acc. Gain
Target only 80.56 - 82.78 -
Source only 70.28 - 71.67 -
DANN [12] 71.11 0.83 75.28 3.61
JAN [27] 71.39 1.11 74.72 3.05
AdaBN [23] 75.56 5.28 72.22 0.55
MCD [39] 71.67 1.39 73.89 2.22
Ours (TA2N) N/A - 77.22 5.55
Ours (TA3N) N/A - 78.33 6.66
Table 3: The comparison of accuracy (%) with other approaches on UCF-HMDBfull (U H). Gain represents the absolute difference from the “Source only” accuracy. TA2N and TA3N are based on the TemRelation architecture, so they are not applicable to TemPooling.
Temporal Module TemPooling TemRelation
Acc. Gain Acc. Gain
Target only 92.12 - 94.92 -
Source only 74.96 - 73.91 -
DANN [12] 75.13 0.17 76.36 2.45
JAN [27] 80.04 5.08 79.69 5.79
AdaBN [23] 76.36 1.40 77.41 3.51
MCD [39] 76.18 1.23 79.34 5.44
Ours (TA2N) N/A - 80.56 6.66
Ours (TA3N) N/A - 81.79 7.88
Table 4: The comparison of accuracy (%) with other approaches on UCF-HMDBfull (H U).

Kinetics-Gameplay. Kinetics-Gameplay is much more challenging than UCF-HMDBfull because the data is from real and virtual domains, which have more severe domain shifts. Here we only utilize TemRelation as our backbone architecture since it is proved to outperform TemPooling on UCF-HMDBfull. Table 5 shows that the accuracy gap between “Source only” and “Target only” is 47.27%, which is more than twice the number in UCF-HMDBfull. In this dataset, TA3N also outperforms all the other DA approaches by increasing the “Source only ” accuracy from 17.22% to 27.50%.

Acc. Gain
Target only 64.49 -
Source only 17.22 -
DANN [12] 20.56 3.34
JAN [27] 18.16 0.94
AdaBN [23] 20.29 3.07
MCD [39] 19.76 2.54
Ours (TA2N) 24.30 7.08
Ours (TA3N) 27.50 10.28
Table 5: The comparison of accuracy (%) with other approaches on Kinetics-Gameplay.

5.3 Ablation Study and Analysis

Integration of G^d. We use UCF-HMDBfull to investigate the performance for integrating G^d in different positions. There are three ways to insert the adversarial discriminator into our architectures, where each corresponds to different feature representations, leading to three types of discriminators G^sd, G^td and G^rd, which are shown in Figure 4 and the full experimental results are shown in Table 6. For the TemRelation architecture, the accuracy of utilizing G^td shows better performance than utilizing G^sd (averagely 0.58% absolute gain improvement across two tasks), while the accuracies are the same for TemPooling. This means that the temporal relation module can encode temporal dynamics that help the video DA problem, but temporal pooling cannot. Utilizing the relation discriminator G^rd can further improve the performance (0.92% improvement) since we simultaneously align and learn the temporal dynamics across domains. Finally, by combining all three discriminators, TA2N improves even more (4.20% improvement).

S T UCF HMDB HMDB UCF
Temporal TemPooling TemRelation TemPooling TemRelation
Module
Target only 80.56 (-) 82.78 (-) 92.12 (-) 94.92 (-)
Source only 70.28 (-) 71.67 (-) 74.96 (-) 73.91 (-)
G^sd 71.11 (0.83) 74.44 (2.77) 75.13 (0.17) 74.44 (1.05)
G^td 71.11 (0.83) 74.72 (3.05) 75.13 (0.17) 75.83 (1.93)
G^rd - (-) 76.11 (4.44) - (-) 75.13 (1.23)
All G^d 71.11 (0.83) 77.22 (5.55) 75.13 (0.17) 80.56 (6.66)
Table 6: The full evaluation of accuracy (%) for integrating G^d in different positions without the attention mechanism. Gain values are in ().

Attention mechanism. In addition to TemRelation, we also apply the domain attention mechanism to TemPooling by attending to the raw frame features instead of relation features, and improve the performance as well, as shown in creftypecap 7. This implies that video DA can benefit from the domain attention even if the backbone architecture does not encode temporal dynamics. We also compare the domain attention module with the general attention module, which calculates the attention weights via the FC-Tanh-FC-Softmax architecture. However, it performs worse since the weights are computed within one domain, lacking of the consideration of domain discrepancy, as shown in creftypecap 8.

S T UCF HMDB HMDB UCF
Temporal TemPooling TemRelation TemPooling TemRelation
Module
Target only 80.56 (-) 82.78 (-) 92.12 (-) 94.92 (-)
Source only 70.28 (-) 71.67 (-) 74.96 (-) 73.91 (-)
All G^d 71.11 (0.83) 77.22 (5.55) 75.13 (0.17) 80.56 (6.66)
All G^d 73.06 (2.78) 78.33 (6.66) 78.46 (3.50) 81.79 (7.88)
+Domain Attn.
Table 7: The affect of the domain attention mechanism.
S T UCF HMDB HMDB UCF
Target only 82.78 (-) 94.92 (-)
Source only 71.67 (-) 73.91 (-)
No Attention 77.22 (5.55) 80.56 (6.66)
General Attention 77.22 (5.55) 80.91 (7.00)
Domain Attention 78.33 (6.66) 81.79 (7.88)
Table 8: The comparison of different attention methods.

Visualization of distribution. To investigate how our approaches bridge the gap between source and target domains, we visualize the distribution of both domains using t-SNE [31]. Figure 5 shows that TA3N can group source data (blue dots) into denser clusters and generalize the distribution into the target domains (orange dots) as well.

(a) TemPooling + DANN [12]
(b) TA3N
Figure 5: The comparison of t-SNE visualization. The blue dots represent source data while the orange dots represent target data. See the supplementary for more comparison.

Domain discrepancy measure. To measure the alignment between different domains, we use Maximum Mean Discrepancy (MMD) and domain loss, which are calculated using the final video representations. Lower MMD values and higher domain loss both imply smaller domain gap. TA3N reaches lower discrepancy loss (0.0842) compared to the TemPooling baseline (0.184), and shows great improvement in terms of the domain loss (from 1.116 to 1.9286), as shown in Table 9.

Discrepancy Domain Validation
loss loss accuracy
TemPooling 0.1840 1.1163 70.28
TemPooling + DANN [12] 0.1604 1.2023 71.11
TemRelation 0.2626 1.7588 71.67
TA3N 0.0842 1.9286 78.33
Table 9: The discrepancy loss (MMD), domain loss and validation accuracy of our baselines and proposed approaches.

6 Conclusion and Future Work

In this paper, we present two large-scale datasets for video domain adaptation, UCF-HMDBfull and Kinetics-Gameplay, including both real and virtual domains. We use these datasets to investigate the domain shift problem across videos, and show that simultaneously aligning and learning temporal dynamics achieves effective alignment without the need for sophisticated DA methods. Finally, we propose Temporal Attentive Adversarial Adaptation Network (TA𝟑N) to simultaneously attend, align and learn temporal dynamics across domains, achieving state-of-the-art performance on all of the cross-domain video datasets investigated. We plan to release the code and datasets.

The ultimate goal of our research is to solve real-world problems. Therefore, in addition to integrating more DA approaches into our video DA pipelines, there are two main directions we would like to pursue for future work: 1) apply TA3N to different cross-domain video tasks, including video captioning, segmentation, and detection; 2) we would like to extend these methods to the open-set setting [1, 40, 34, 15], which has different categories between source and target domains. The open-set setting is much more challenging but closer to real-world scenarios.

7 Supplementary

In the supplementary material, we would like to show more detailed ablation studies, more implementation details, and a complete introduction of the datasets.

7.1 Visualization of distribution

We visualize the distribution of both domains using t-SNE [31] to investigate how our approaches bridge the gap between the source and target domains. creftypepluralcap 5(b)\crefpairconjunction5(a) show that the models using the TemPooling architecture poorly align the distribution between different domains, even with the integration of image-based DA approaches. Figure 5(c) shows the temporal relation module helps to group source data (blue) into denser clusters but is still not able to generalize the distribution into the target domains (orange). Finally, with TA3N, data from both domains are clustered and aligned with each other (Figure 5(d)).

(a) TemPooling
(b) TemPooling + DANN [12]
(c) TemRelation
(d) TA3N
Figure 6: The comparison of t-SNE visualization with source (blue) and target (orange) distributions.

7.2 Domain Attention Mechanism

We also apply the domain attention mechanism to TemPooling by attending to the raw frame features, as shown in creftypecap 7. creftypepluralcap 11\crefpairconjunction10 show that the domain attention mechanism improves the performance for both TemPooling and TemRelation architectures, including all types of adversarial discriminators. This implies that video DA can benefit from domain attention even if the backbone architecture does not encode temporal dynamics.

Figure 7: Baseline architecture (TemPooling) equipped with the domain attention mechanism (ignoring the input feature parts to save space).
Temporal TemPooling TemPooling TemRelation TemRelation
Module + Attn. + Attn.
Target only 80.56 (-) 82.78 (-)
Source only 70.28 (-) 71.67 (-)
G^sd 71.11 (0.83) 71.94 (1.66) 74.44 (2.77) 75.00 (3.33)
G^td 71.11 (0.83) 72.78 (2.50) 74.72 (3.05) 76.94 (5.27)
G^rd - (-) - (-) 76.11 (4.44) 76.94 (5.27)
All G^d 71.11 (0.83) 73.06 (2.78) 77.22 (5.55) 78.33 (6.66)
Table 10: The evaluation of accuracy (%) for integrating G^d in different positions on “U H” . Gain values are in ().
Temporal TemPooling TemPooling TemRelation TemRelation
Module + Attn. + Attn.
Target only 92.12 (-) 94.92 (-)
Source only 74.96 (-) 73.91 (-)
G^sd 75.13 (0.17) 77.58 (2.62) 74.44 (1.05) 78.63 (4.72)
G^td 75.13 (0.17) 78.46 (3.50) 75.83 (1.93) 81.44 (7.53)
G^rd - (-) - (-) 75.13 (1.23) 78.98 (5.07)
All G^d 75.13 (0.17) 78.46 (3.50) 80.56 (6.66) 81.79 (7.88)
Table 11: The evaluation of accuracy (%) for integrating G^d in different positions on “H U” . Gain values are in ().

7.3 Implementation Details

7.3.1 Detailed architectures

The architecture with detailed notations for the baseline is shown in creftypecap 8. For our proposed TA3N, after generating the n-frame relation features Rn by the temporal relation module, we calculate the domain attention value wn using the domain prediction d^ from the relation discriminator Grdn, and then attend to Rn using wn with a residual connection. To calculate the attentive entropy loss ae, since the videos with low domain discrepancy are what we only want to focus on, we attend to the class entropy loss H(y^) using the domain entropy H(d^) as the attention value with a residual connection, as shown in creftypecap 9.

Figure 8: The detailed baseline architecture (TemPooling) with the adversarial discriminators G^sd and G^td.
Figure 9: The detailed architecture of the proposed TA3N.

7.3.2 Optimization

Our implementation is based on the PyTorch [33] framework. We utilize the ResNet-101 model pre-trained on ImageNet as the frame-level feature extractor. We sample a fixed number K of frame-level feature vectors with equal spacing in the temporal direction for each video (K is equal to 5 in our setting to limit computational resource requirements). For optimization, the initial learning rate is 0.03, and we follow one of the commonly used learning-rate-decreasing strategies shown in DANN [12]. We use stochastic gradient descent (SGD) as the optimizer with the momentum and weight decay as 0.9 and 1×10-4, respectively. The ratio between the source and target batch size is proportional to the scale between the source and target datasets. The source batch size depends on the scale of the dataset, which is 32 for UCF-Olympic and UCF-HMDBsmall, 128 for UCF-HMDBfull and 512 for Kinetics-Gameplay. The optimized values of λs, λr and λt are found using the coarse-to-fine grid-search approach. We first search using a coarse-grid with the geometric sequence [0, 10-3, 10-2, …, 100, 101]. After finding the optimized range of values, [0, 1], we search again using a fine-grid with the arithmetic sequence [0, 0.25, …, 1]. The final values are 0.75 for λs, 0.5 for λr and 0.75 for λt, respectively. We search γ only by a coarse-grid, and the best value is 0.3. For future work, we plan to adopt adaptive weighting techniques used for multitask learning, such as uncertainty weighting [20] and GradNorm [4], to replace the manual grid-search method.

7.3.3 Comparison with other work

As mentioned in the experimental setup, we compare our proposed TA3N with other approaches by extending several state-of-the-art image-based DA methods  [12, 27, 23, 39] for video DA with our TemPooling and TemRelation architectures, which are shown as follows:

  1. 1.

    DANN [12]: we add one adversarial discriminator G^sd right after the spatial module and add another one G^td right after the temporal module. We do not add one more discriminator for relation features for the fair comparison between TemPooling and TemRelation.

  2. 2.

    JAN [27]: we add Joint Maximum Mean Discrepancy (JMMD) to the final video representation and the class prediction.

  3. 3.

    AdaBN [23]: we integrate an adaptive batch-normalization layer into the feature generator Gsf. In the adaptive batch-normalization layer, the statistics (mean and variance) for both source and target domains are calculated, but only the target statistics are used for validating the target data.

  4. 4.

    MCD [39]: we add another classifier Gy and follow the adversarial training procedure of Maximum Classifier Discrepancy to iteratively optimize the generators (Gsf and Gtf) and the classifier (Gy).

7.4 Datasets

The full summary of all four datasets investigated in this paper is shown in creftypecap 12.

UCF-HMDBsmall UCF-Olympic UCF-HMDBfull Kinetics-Gameplay
length (sec.) 1 - 21 1 - 39 1 - 33 1 - 10
resolution UCF: 320×240 / Olympic: vary / HMDB: vary×240 / Kinetics: vary / Gameplay: 1280×720
frame rate UCF: 25 / Olympic: 30 / HMDB: 30 / Kinetics: vary / Gameplay: 30
class # 5 6 12 30
training video # UCF: 482 / HMDB: 350 UCF: 601 / Olympic: 250 UCF: 1438 / HMDB: 840 Kinetics: 43378 / Gameplay: 2625
validation video # UCF: 189 / HMDB: 150 UCF: 240 / Olympic: 54 UCF: 571 / HMDB: 360 Kinetics: 3246 / Gameplay: 749
Table 12: The summary of the cross-domain video datasets.

7.4.1 UCF-HMDB𝒇𝒖𝒍𝒍

We collect all of the relevant and overlapping categories between UCF101 [43] and HMDB51 [21], which results in 12 categories: climb, fencing, golf, kick_ball, pullup, punch, pushup, ride_bike, ride_horse, shoot_ball, shoot_bow, and walk. Each category may correspond to multiple categories in the original UCF101 or HMDB51 dataset, as shown in creftypecap 13. This dataset, UCF-HMDBfull, includes 1438 training videos and 571 validation videos from UCF, and 840 training videos and 360 validation videos from HMDB, as shown in creftypecap 12. Most videos in UCF are from certain scenarios or similar environments, while videos in HMDB are in unconstrained environments and different camera angles, as shown in creftypecap 10.

UCF-HMDBfull UCF HMDB
climb RockClimbingIndoor, climb
RopeClimbing
fencing Fencing fencing
golf GolfSwing golf
kick_ball SoccerPenalty kick_ball
pullup PullUps pullup
punch Punch, punch
BoxingPunchingBag,
BoxingSpeedBag
pushup PushUps pushup
ride_bike Biking ride_bike
ride_horse HorseRiding ride_horse
shoot_ball Basketball shoot_ball
shoot_bow Archery shoot_bow
walk WalkingWithDog walk
Table 13: The lists of all collected categories in UCF and HMDB.
(a) fencing
(b) kick_ball
(c) walk
Figure 10: Snapshots of some example categories on UCF-HMDBfull. For each category, the snapshots from UCF are shown in the upper row, and the snapshots from HMDB are shown in the lower row.

7.4.2 Kinetics-Gameplay

We create the Gameplay dataset by first collecting gameplay videos from two video games, Detroit: Become Human and Fortnite, to build our own action dataset for the virtual domain. The total length of the videos is 5 hours and 41 minutes. We segment all of the raw, untrimmed videos into video clips according to human annotations, which results in 91 categories: argue, arrange_object, assemble_object, break, bump, carry, carve, chop_wood, clap, climb, close_door, close_others, crawl, cross_arm, crouch, crumple, cry, cut, dance, draw, drink, drive, eat, fall_down, fight, fix_hair, fly_helicopter, get_off, grab, haircut, hit, hit_break, hold, hug, juggle_coin, jump, kick, kiss, kneel, knock, lick, lie_down, lift, light_up, listen, make_bed, mop_floor, news_anchor, open_door, open_others, paint_brush, pass_object, pet, poke, pour, press, pull, punch, push, push_object, put_object, raise_hand, read, row_boat, run, shake_hand, shiver, shoot_gun, sit, sit_down, slap, sleep, slide, smile, stand, stand_up, stare, strangle, swim, switch, take_off, talk, talk_phone, think, throw, touch, walk, wash_dishes, water_plant, wave_hand, and weld. The maximum length for each video clip is 10 seconds, and the minimum is 1 second. We also split the dataset into training, validation, and testing sets by randomly selecting videos in each category with the ratio 7:2:1. We build the Kinetics-Gameplay dataset by selecting 30 overlapping categories between Gameplay and one of the largest public video datasets Kinetics-600 [19, 2]: break, carry, clean_floor, climb, crawl, crouch, cry, dance, drink, drive, fall_down, fight, hug, jump, kick, light_up, news_anchor, open door, paint_brush, paraglide, pour, push, read, run, shoot_gun, stare, talk, throw, walk, and wash_dishes. Each category may also correspond to multiple categories in both datasets, as shown in creftypecap 14. Kinetics-Gameplay includes 43378 training videos and 3246 validation videos from Kinetics, and 2625 training videos and 749 validation videos from Gameplay, as shown in creftypecap 12. Kinetics-Gameplay is much more challenging than UCF-HMDBfull due to the significant domain shift between the distributions of virtual and real data. Furthermore, The alignment between imbalanced-scaled source and target data is also another challenge. Some example snapshots are shown in creftypecap 11.

Kinetics-Gameplay Kinetics Gameplay
break breaking boards, smashing break, bump, hit_break
carry carrying baby carry
clean_floor mopping floor mop_floor
climb climbing a rope, climbing ladder, climbing tree, climb
ice climbing, rock climbing
crawl crawling baby crawl
crouch squat, lunge crouch, kneel
cry crying cry
dance belly dancing, krumping, robot dancing dance
drink drinking shots, tasting beer drink
drive driving car, driving tractor drive
fall_down falling off bike, falling off chair, faceplanting fall_down
fight pillow fight, capoeira, wrestling, fight, strangle,
punching bag, punching person (boxing) punch, hit
hug hugging (not baby), hugging baby hug
jump high jump, jumping into pool, jump
parkour
kick drop kicking, side kick kick
light_up lighting fire light_fire
news_anchor news anchoring news_anchor
open_door opening door, opening refrigerator open_door
paint_brush brush painting paint_brush
paraglide paragliding paraglide
pour pouring beer pour
push pushing car, pushing cart, pushing wheelbarrow, push,
pushing wheelchair, push up push_object
read reading book, reading newspaper read
run running on treadmill, jogging run
shoot_gun playing laser tag, playing paintball shoot_gun
stare staring stare
talk talking on cell phone, arguing, testifying talk, argue, talk_phone
throw throwing axe, throwing ball (not baseball or American football), throw
throwing knife, throwing water balloon
walk walking the dog, walking through snow, jaywalking walk
wash_dishes washing dishes wash_dishes
Table 14: The lists of all collected categories in Kinetics and Gameplay.
Figure 11: Some example screenshots from YouTube videos in Kinetics-Gameplay (left two: Gameplay, right two: Kinetics)

7.5 More Details

7.5.1 JAN on Kinetics-Gameplay

JAN [27] does not perform well on Kinetics-Gameplay compared to the performance on UCF-HMDBfull. The main reason is the imbalanced size between the source and target data in Kinetics-Gameplay. The discrepancy loss MMD is calculated using the same number of source and target data (not the case for other types of DA approaches). Therefore, in each iteration, MMD is calculated using parts of the source batch and the whole target batch. This means that the domain discrepancy is reduced only between part of source data and target data during training, so the learned model is still overfitted to the source domain. The discrepancy loss MMD works well when the source and target data are balanced, which is the case for most image DA datasets and UCF-HMDBfull, but not for Kinetics-Gameplay.

7.5.2 Comparison with AMLS [17]

When evaluating on UCF-HMDBsmall, AMLS [17] fine-tunes their networks using UCF and HMDB, respectively, before applying their DA approach. Here we only show their results which are fine-tuned with source data, because the target labels should be unseen during training in unsupervised DA settings. For example, we don’t compare their results which test on HMDB data using the models fine-tuned with HMDB data since it is not unsupervised DA.

7.5.3 Other baselines

3D ConvNets [46] have also been used for extracting video-level feature representations. However, 3D ConvNets consume a great deal of GPU memory, and [47] also shows that 3D ConvNets are limited by efficiency and effectiveness issues when extracting temporal information.

Optical-flow extracts the motion characteristics between neighbor frames to compensate for the lack of temporal information in raw RGB frames. In this paper, we focus on attending to the temporal dynamics to effectively align domains even with only RGB frames. We consider optical-flow to be complementary to our method.

7.5.4 Comparison with literature in other fields

Cycle-consistency. Some papers related to cycle-consistency [50, 8] introduce self-supervised methods for learning visual correspondence between images or videos from unlabeled videos. They use cycle-consistency as free supervision to learn video representations. The main difference from our approach is that we explicitly align the feature spaces between source and target domains, while these self-supervised methods aim to learn general representations using only the source domain. We see cycle-consistency as a complementary method that can be integrated into our approach to achieve more effective domain alignment.

Robotics. In Robotics, it is a common trend to transfer the models trained in simulation to real world. One of the effective method to bridge the domain gap is randomizing the dynamics of the simulator during training to improve the robustness for different environments [35]. The setting is different from our task because we focus on feature learning rather than policy learning, and we see domain randomization as a complementary technique that can extend our approach to a more generalized version.

7.5.5 Failure cases for TemRelation

TemRelation shows limited improvement over TemPooling for some categories with consistency across time. For example, with the same DA method (DANN), TemRelation has the same accuracy with TemPooling for ride_bike (97%), and has lower accuracy for ride_horse (93% and 97%). The possible reason is that temporal pooling can already model temporally consistent actions well, and it may be redundant to model these actions with multiple timescales like TemRelation.

7.5.6 Testing time for TA3N

Different from TA2N, TA3N passes data to all the domain discriminators during testing. However, since all our domain discriminators are shallow, the testing time is similar. In our experiment, TA3N only computes 10% more time than TA2N.

References

  • [1] Pau Panareda Busto and Juergen Gall. Open set domain adaptation. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [2] Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  • [3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [4] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning (ICML), 2018.
  • [5] Gabriela Csurka. A comprehensive survey on domain adaptation for visual applications. In Domain Adaptation in Computer Vision Applications, pages 1–35. Springer, 2017.
  • [6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [7] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
  • [8] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [9] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for video action recognition. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [10] Geoff French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual domain adaptation. In International Conference on Learning Representations (ICLR), 2018.
  • [11] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), 2015.
  • [12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [15] Yen-Chang Hsu, Zhaoyang Lv, and Zsolt Kira. Learning to cluster in order to transfer across domains and tasks. In International Conference on Learning Representations (ICLR), 2018.
  • [16] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
  • [17] Arshad Jamal, Vinay P Namboodiri, Dipti Deodhare, and KS Venkatesh. Deep domain adaptation in action space. In British Machine Vision Conference (BMVC), 2018.
  • [18] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [19] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [20] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In IEEE International Conference on Computer Vision (ICCV), 2011.
  • [22] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. Sliced wasserstein discrepancy for unsupervised domain adaptation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [23] Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and Jiaying Liu. Adaptive batch normalization for practical domain adaptation. Pattern Recognition, 80:109–117, 2018.
  • [24] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normalization for practical domain adaptation. In International Conference on Learning Representations Workshop (ICLRW), 2017.
  • [25] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning (ICML), 2015.
  • [26] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  • [27] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning (ICML), 2017.
  • [28] Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. Attention clusters: Purely attention based local feature integration for video classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [29] Chih-Yao Ma, Min-Hung Chen, Zsolt Kira, and Ghassan AlRegib. Ts-lstm and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Processing: Image Communication, 2018.
  • [30] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. Attend and interact: Higher-order object interactions for video understanding. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [31] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. The Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • [32] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(10):1345–1359, 2010.
  • [33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop (NeurIPSW), 2017.
  • [34] Xingchao Peng, Ben Usman, Kuniaki Saito, Neela Kaushik, Judy Hoffman, and Kate Saenko. Syn2real: A new benchmark for synthetic-to-real visual domain adaptation. arXiv preprint arXiv:1806.09755, 2018.
  • [35] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation (ICRA), 2018.
  • [36] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [37] Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009.
  • [38] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. Adversarial dropout regularization. In International Conference on Learning Representations (ICLR), 2018.
  • [39] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [40] Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. Open set domain adaptation by backpropagation. In European Conference on Computer Vision (ECCV), 2018.
  • [41] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • [42] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
  • [43] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [44] Waqas Sultani and Imran Saleemi. Human action recognition across datasets by foreground-weighted histogram decomposition. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [45] Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. In European Conference on Computer Vision Workshop (ECCVW), 2016.
  • [46] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [47] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [48] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [49] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV), 2016.
  • [50] Xiaolong Wang, Allan Jabri, and Alexei A Efros. Learning correspondence from the cycle-consistency of time. In IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [51] Ximei Wang, Liang Li, Weirui Ye, Mingsheng Long, and Jianmin Wang. Transferable attention for domain adaptation. In AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [52] Tiantian Xu, Fan Zhu, Edward K Wong, and Yi Fang. Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Computing, 55:127–137, 2016.
  • [53] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [54] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschläger, and Susanne Saminger-Platz. Central moment discrepancy (cmd) for domain-invariant representation learning. In International Conference on Learning Representations (ICLR), 2017.
  • [55] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Collaborative and adversarial network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [56] Xiao-Yu Zhang, Haichao Shi, Changsheng Li, Kai Zheng, Xiaobin Zhu, and Lixin Duan. Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision. In AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [57] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV), 2018.