Abstract
Stateoftheart stereo matching networks have difficulties in generalizingto new unseen environments due to significant domain differences, such ascolor, illumination, contrast, and texture. In this paper, we aim at designinga domaininvariant stereo matching network (DSMNet) that generalizes well tounseen scenes. To achieve this goal, we propose i) a novel "domainnormalization" approach that regularizes the distribution of learnedrepresentations to allow them to be invariant to domain differences, and ii) atrainable nonlocal graphbased filter for extracting robust structural andgeometric representations that can further enhance domaininvariantgeneralizations. When trained on synthetic data and generalized to real testsets, our model performs significantly better than all stateoftheart models.It even outperforms some deep learning models (e.g. MCCNN) finetuned withtestdomain data.
Quick Read (beta)
Domaininvariant Stereo Matching Networks
Abstract
Stateoftheart stereo matching networks have difficulties in generalizing to new unseen environments due to significant domain differences, such as color, illumination, contrast, and texture. In this paper, we aim at designing a domaininvariant stereo matching network (DSMNet) that generalizes well to unseen scenes. To achieve this goal, we propose i) a novel “domain normalization” approach that regularizes the distribution of learned representations to allow them to be invariant to domain differences, and ii) a trainable nonlocal graphbased filter for extracting robust structural and geometric representations that can further enhance domaininvariant generalizations. When trained on synthetic data and generalized to real test sets, our model performs significantly better than all stateoftheart models. It even outperforms some deep learning models (e.g. MCCNN [54]) finetuned with testdomain data. The code and dataset will be avialable at https://github.com/feihuzhang/DSMNet.
1 Introduction
Stereo reconstruction is a fundamental problem in computer vision, robotics and autonomous driving. It aims to estimate 3D geometry by computing disparities between matching pixels in a stereo image pair. Recently, many endtoend deep neural network models (e.g. [56, 4, 17]) have been developed for stereo matching that achieve impressive accuracy on several datasets or benchmarks.
However, stateoftheart stereo matching networks (supervised [56, 4, 17] and unsupervised [59, 45]) cannot generalize well to unseen data without finetuning or adaptation. Their difficulties lie in the large domain differences (such as color, illumination, contrast and texture) between stereo images in various datasets. As illustrated in Fig. 1, the pretrained models on one specific dataset produce poor results on other real and unseen scenes.
Domain adaptation and transfer learning methods (e.g. [45, 11, 3]) attempt to transfer or adapt from one source domain to another new domain. Typically, a large number of stereo images from the new domain are required for the adaptation. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.
Thus, it is desirable to design a model that can generalize well to unseen data without retraining or adaptation. The difficulties for developing such a domain invariant stereo matching network (DSMNet) come from the significant domain differences between stereo images in various scenes/datasets (e.g. Fig. 1(a) and 1(b)). Such differences make the learned features unstable, distorted and noisy, leading to many wrong matching results.
Fig. 1 visualizes the features learned by some stateoftheart stereo matching models [56, 4, 53]. Due to the limited effective receptive field of convolutional neural networks [28], they capture the domainsensitive local patterns (e.g. local contrast, edge and texture) when constructing matching features, which, however, break down and produce a lot of artifacts (e.g. noises) in the feature maps when applied to the novel test data (Fig. 1(c)). The artifacts and distortions in the features inhibit robust matching, leading to wrong matching results (Fig. 1(e)).
In this paper, we propose two novel neural network layers for constructing the robust deep stereo matching network for crossdomain generalization without further finetuning or adaptation. Firstly, to reduce the domain shifts/differences between different datasets/scenes, we propose a novel domain normalization layer that fully regulates the feature’s distribution in both the spatial (height and width) and the channel dimensions. Secondly, to eliminate the artifacts and distortions in the features, we propose a learnable nonlocal graphbased filtering layer that can capture more robust structural and geometric representations (e.g. shape and structure, as illustrated in Fig. 1(d)) for domaininvariant stereo matching.
We formulate our method as an endtoend deep neural network model and train it only with synthetic data. In our experiments, without any finetuning or adaptation on the real test datasets, our DSMNet far outperforms: 1) almost all stateoftheart stereo matching models (e.g. GANet[56]) trained on the same synthetic dataset, 2) most of the traditional methods (e.g. Cosfter filter, SGM [13] et al.), 3) most of the unsupervised/selfsupervised models trained on the target test domains. Our model even surpasses some of the finetuned (on the target domains) supervised deep neural network models (e.g. MCCNN[54], contentCNN[29], DispNetC [30] et al.).
2 Related Work
2.1 Deep Neural Networks for Stereo Matching
In recent years, deep neural networks have seen great success in the task of stereo matching [17, 4, 56]. These models can be categorized into three types: 1) learning better features for traditional stereo matching algorithms, 2) correlationbased endtoend deep neural networks, 3) costvolume based stereo matching networks.
In the first category, deep neural networks have been used to compute patchwise similarity scores as the matching costs [57, 54]. The costs are then fed into the traditional cost aggregation and disparity computation/refinement methods [13] to get the final disparity maps. The models are, however, limited by the traditional matching cost aggregation step and often produce wrong predictions in occluded regions, large textureless/reflective regions and around object edges.
DispNetC [30], a typical method in the second category, computes the correlations by warping between stereo views and attempts to predict the perpixel disparity by minimizing a regression training loss. Many other sateoftheart methods, including iResNet [25], CRL[36], SegStereo [51], EdgeStereo [42], HD${}^{3}$ [53], and MADNet [45], are all based on color or feature correlations between the left and right views for disparity estimation.
The recently developed costvolume based models explicitly learn feature extraction, cost volume, and regularization function all end to end. Examples include GCNet[17], PSMNet[4] , StereoNet [18], AnyNet [49], GANet [56] and EMCUA [34]. They all utilize a similarity cost as the third dimension to build the 4D cost volume in which the real geometric context is maintained.
There are also others that combine the correlation and cost volume strategies (e.g. [12]).
The common feature of these models is that they all require a large number of training samples with ground truth depth/disparity. More importantly, a model trained on one specific domain cannot generalize well to new scenes without finetuning or retraining.
2.2 Adaptation and Selfsupervised Learning
Selfsupervised Learning:
A recent trend of training stereo matching networks in an unsupervised manner relies on image reconstruction losses that are achieved by warping left and right views [59, 58]. However, they cannot solve the occlusions and reflective regions where there is no correspondence between the left and the right views. Also, they cannot generalize well to other new domains.
Domain Adaptation:
Some methods pretrain the models on synthetic data and then explore the crossdomain knowledge to adapt [11, 37] for a new domain. Others focus on the online or offline adaptations [44, 45, 43, 39]. For example, MADNet [45] is proposed to adapt the pretrained model online and in real time. But, it has poor accuracy even after the adaptation. Moreover, the domain adaptation approaches require a large number of stereo images from the target domain for adaptations. However, these cannot be easily obtained in many real scenarios. And, in this case, we still need a good method for disparity estimation even without data from the new domain for adaptation.
2.3 CrossDomain Generalization
Different to domain adaptation, domain generalization is a much harder problem that assumes no access to target information for adaptation or finetuning. There are many approaches that explore the idea of domaininvariant feature learning. Previous approaches focus on developing datadriven strategies to learn invariant features from different source domains [32, 10, 20]. Some recent methods utilize metalearning that takes variations in multiple source domains to generalize to novel test distributions [1, 21]. Other approaches [23, 22] employ an invariant adversarial network to learn domaininvariant representation/features for image recognition. Choy et al. [6] develop a universal feature learning framework for visual correspondences using deep metric learning.
In contrast to the above approaches, there are methods that try to improve the batch or instance normalization in order to improve the generalization and robustness for style transfer or image recognition [33, 24, 35].
In summary, for stereo matching, work is seldom done to improve the generalization ability of the endtoend deep neural network models, especially when developing the domaininvariant stereo matching networks.
3 Proposed DSMNet
To overcome the challenges in crossdomain generalization, we develop in the following sections our domaininvariant stereo matching networks. These include domain normalization to remove the influence of the domain shifts (e.g. color, style, illuminance), as well as nonlocal graphbased filtering and aggregation to capture the nonlocal structural and geometric context as robust features for domaininvariant stereo reconstruction.
3.1 Domain Normalization
Batch normalization (BN) has become the default feature normalization operation for constructing endtoend deep stereo matching networks [17, 4, 56, 42, 45, 30]. Although it can reduce the internal covariate shift effects in training deep networks, it is domaindependent and has negative influence on the model’s crossdomain generalization ability.
BN normalizes the features as follows:
$${\widehat{x}}_{i}=\frac{1}{\sigma}({x}_{i}{\mu}_{i}).$$  (1) 
Here $x$ and $\widehat{x}$ are the input and output features, respectively, and $i$ indexes elements in a tensor (i.e. feature maps, as illustrated in Fig. 2) of size $N\times C\times H\times W$ ($N$: batch size, $C$: channels, $H$: spatial height, $W$: spatial width). ${\mu}_{i}$ and ${\sigma}_{i}$ are the corresponding channelwise mean and standard deviation (std) and are computed by:
$${\mu}_{i}=\frac{1}{m}\sum _{k\in {S}_{i}}{x}_{k},{\sigma}_{i}=\sqrt{\frac{1}{m}\sum _{k\in {S}_{i}}{({x}_{k}{\mu}_{i})}^{2}+\u03f5},$$  (2) 
where ${S}_{i}$ is the set of elements in the same channel as element $i$ (Fig. 2), and $\u03f5$ is a small constant to avoid dividing by zeros.
Mean $\mu $ and standard deviation $\sigma $ are computed per batch in the training phase, and the accumulated values of the training set are utilized for inference. However, different domains may have different $\mu $ and $\sigma $ caused by color shifts, contrast, and illumination (Fig. 1(a) and 1(b)). Thus $\mu $ and $\sigma $ computed for one dataset are not transferable to others.
Instance normalization (IN) [33, 38] overcomes the dependency on dataset statistics by normalizing each sample separately, where elements in ${S}_{i}$ are confined to be from the same sample as illustrated in Fig. 2. In theory, IN is domaininvariant, and normalization across the spatial dimensions ($H$, $W$) reduces imagelevel appearance/style variations.
However, matching of stereo views is realized at the pixel level by finding an accurate correspondence for each pixel using its $C$channel feature vector. Any inconsistence of the feature norm and scaling will significantly influence the matching cost and similarity measurements.
Fig. 3 illustrates that IN cannot regulate the norm distribution of pixelwise feature vectors that vary in datasets/domains.
We propose in Fig. 2 our domaininvariant normalization (DN). Our method normalizes features along the spatial axis ($H$, $W$) to induce styleinvariant representations similar to IN as well as along the channel dimension ($C$) to enhance the local invariance.
Our DN is realized as follows:
$${\widehat{x}}_{i}^{\prime}=\frac{{\widehat{x}}_{i}}{\sqrt{{\sum}_{i\in {S}_{i}^{\prime}}{{\widehat{x}}_{i}}^{2}+\u03f5}},$$  (3) 
where ${S}_{i}^{\prime}$ (green region in Fig. 2) includes $C$ elements from the same example ($N$ axis) and the same spatial location ($H$, $W$ axis). ${\widehat{x}}_{i}$ is computed as Eq. (1) and (2) with elements in ${S}_{i}$ from the same channel and sample (blue region in Fig. 2). In DN, besides normalization across spatial dimension, we also employ ${L}_{2}$ normalization to normalize features along the channel axis. They collaborate with each other to address the address the sensitivity to domain shift as well as stress noises and extreme values in feature vectors. As illustrated in Fig. 3, it helps regulate the norm distribution of the features in different datasets and improves the robustness to local domain shifts (e.g. texture pattern, noise, contrast).
Finally, the trainable perchannel scale $\gamma $ and shift $\beta $ are added to enhance the discriminative representation ability as BN and IN. The final formulation is as follows:
$${y}_{i}={\gamma}_{i}{\widehat{x}}_{i}^{\prime}+{\beta}_{i}.$$  (4) 
3.2 Nonlocal Aggregation
We propose a graphbased filter that robustly exploits nonlocal contextual information and reduces the dependence on local patterns (see Fig. 1(c)) for domaininvariant stereo matching.
3.2.1 Formulation
Our inspiration comes from traditional graphbased filters that are remarkably effective in employing nonlocal structural information for structurepreserving texture and detail removing/smoothing [55], denoising [55, 5], as well as depthaware estimation and enhancement [26, 52].
For a 2D image/feature map $I$, we construct an 8connected graph by connecting pixel $\mathbf{p}$ to its eight neighbors (see Fig. 4). To avoid loops and achieve fast nonlocal information aggregation over the graph, we split it into two reverse directed graphs ${G}_{1}$, ${G}_{2}$ (see Fig. 4(b) and 4(c)).
We assign weight ${\omega}_{e}$ to each edge $e\in G$, and a feature (or color) vector $C(\mathbf{p})$ to each node $\mathbf{p}\in G$. We also allow $\mathbf{p}$ to propagate information to itself with weight ${\omega}_{e}(\mathbf{p},\mathbf{p})$. For graph ${G}_{i}$ ($i=0,1$), our nonlocal filter is defined as follows:
$$\begin{array}{ccc}\hfill {C}_{i}^{A}(\mathbf{p})& =\hfill & \frac{\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})\cdot C(\mathbf{q})}{\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})},\hfill \\ \hfill W(\mathbf{q},\mathbf{p})& =\hfill & \sum _{{l}_{\mathbf{q},\mathbf{p}}\in {G}_{i}}\prod _{e\in {l}_{\mathbf{q},\mathbf{p}}}{\omega}_{e}.\hfill \end{array}$$  (5) 
Here, ${l}_{\mathbf{q},\mathbf{p}}$ is a feasible path from $\mathbf{q}$ to $\mathbf{p}$. Note that $e(\mathbf{q},\mathbf{q})$ is included in the path and counts for the start node $\mathbf{q}$. Unlike traditional geodesic filters, we consider all valid paths from source node $\mathbf{q}$ to target node $\mathbf{p}$. The propagation weight along path ${l}_{\mathbf{q},\mathbf{p}}$ is the product of all edge weights ${\omega}_{e}$ along the path. Here weight $W(\mathbf{q},\mathbf{p})$ is defined as the sum of the weights of all feasible paths from $\mathbf{q}$ to $\mathbf{p}$, which determines how much information is diffused to $\mathbf{p}$ from $\mathbf{q}$.
For the edge weight ${\omega}_{(\mathbf{q},\mathbf{p})}$, we define it in a selfregularized manner as follows:
$$\begin{array}{cc}{\omega}_{e}(\mathbf{q},\mathbf{p})=\frac{\mathbf{x}_{\mathbf{p}}{}^{T}{\mathbf{x}}_{\mathbf{q}}}{{\parallel {\mathbf{x}}_{\mathbf{p}}\parallel}_{2}\cdot {\parallel {\mathbf{x}}_{\mathbf{q}}\parallel}_{2}},\hfill & \end{array}$$  (6) 
where ${\mathbf{x}}_{\mathbf{p}}$ and ${\mathbf{x}}_{\mathbf{q}}$ represent the feature vectors of $\mathbf{p}$ and $\mathbf{q}$, respectively. This definition does not introduce new parameters and thus is more robust to crossdomain generalization.
Compared to other local filters, such as Gaussian filter, median filter, and mean filter that can only propagate information in a local region determined by the filter kernel size, our proposed nonlocal filter allows the propagation of longrange information with weights as a spatial accumulation along all feasible paths in a graph.
For stable training and to avoid extreme values, we further add a normalization constraint to the weights associated with $\mathbf{p}$ in the graph ${G}_{i}$ as:
$$\sum _{\mathbf{q}\in {N}_{\mathbf{p}}}{\omega}_{e(\mathbf{q},\mathbf{p})}=1.$$  (7) 
Here, ${N}_{\mathbf{p}}$ is the set of the connected neighbors of $\mathbf{p}$ (including itself), and $e(\mathbf{q},\mathbf{p})$ is the directed edge connecting $\mathbf{q}$ and $\mathbf{p}$. For example, in Fig. 4(b), for node ${\mathbf{p}}_{0}$, ${\omega}_{e({\mathbf{p}}_{0},{\mathbf{p}}_{0})}=1$; and for node ${\mathbf{p}}_{4}$, ${\omega}_{0,4}+{\omega}_{1,4}+{\omega}_{e({\mathbf{p}}_{4},{\mathbf{p}}_{4})}=1$.
If Eq. (7) holds, we can further derive ${\sum}_{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})=1$^{1}^{1} 1 The proof is available in the supplementary material.. Eq. (5) can then be simplified as follows:
$$\begin{array}{cc}\hfill {C}_{i}^{A}(\mathbf{p})=& \sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})\cdot C(\mathbf{q}),\hfill \\ \hfill W(\mathbf{q},\mathbf{p})=& \sum _{{l}_{\mathbf{q},\mathbf{p}}\in {G}_{i}}\prod _{e\in {l}_{\mathbf{q},\mathbf{p}}}{\omega}_{e}.\hfill \end{array}$$  (8) 
Such a transformation not only increases the robustness in training but also reduces the computational costs.
3.2.2 Linear Implementation
Eq. (8) can be realized as an iterative linear aggregation, where the node representation is sequentially updated following the direction of the graph (e.g. from top to bottom, then left to right in ${G}_{1}$). In each step, $\mathbf{p}$ is updated as:
$$\begin{array}{cc}\hfill {C}_{i}^{A}(\mathbf{p})& ={\omega}_{e(\mathbf{p},\mathbf{p})}\cdot C(\mathbf{p})+\sum _{\mathbf{q}\in {N}_{\mathbf{p}},\mathbf{q}\ne \mathbf{p}}{\omega}_{e(\mathbf{q},\mathbf{p})}\cdot {C}_{i}^{A}(\mathbf{q})\hfill \\ \hfill s.t.& \sum _{\mathbf{q}\in {N}_{\mathbf{p}}}{\omega}_{e(\mathbf{q},\mathbf{p})}=1.\hfill \end{array}$$  (9) 
Finally, we repeat the aggregation process for both ${G}_{1}$ and ${G}_{2}$ where the updated representation with ${G}_{1}$ is used as the input for aggregation with ${G}_{2}$ (similar to patchmatch stereo [2]). The aggregation of Eq. (9) is a linear process with time complexity of $O(n)$ (with $n$ nodes in the graph). During training, backpropagation can be realized by reversing the propagation equation which is also a linear process (available in the supplementary material).
3.2.3 Relations to Existing Approaches
We show that the recently proposed semiglobal aggregation (SGA) layer[56] and affinitybased propagation approach [27] are special cases of our graphbased nonlocal filter (Eq. (8)). In addition, we compare it with nonlocal neural networks [48, 50] and the attention mechanism [15].
Semiglobal Aggregation (SGA) [56] is proposed as a differentiable approximation of SGM [13] and can be presented as follows:
$$\begin{array}{ccc}\hfill {C}_{\mathbf{r}}^{A}(\mathbf{p},d)& =\hfill & \text{sum}\{\begin{array}{c}{\omega}_{0}(\mathbf{p},\mathbf{r})\cdot C(\mathbf{p},d)\hfill \\ {\omega}_{1}(\mathbf{p},\mathbf{r})\cdot {C}_{\mathbf{r}}^{A}(\mathbf{p}\mathbf{r},d)\hfill \\ {\omega}_{2}(\mathbf{p},\mathbf{r})\cdot {C}_{\mathbf{r}}^{A}(\mathbf{p}\mathbf{r},d1)\hfill \\ {\omega}_{3}(\mathbf{p},\mathbf{r})\cdot {C}_{\mathbf{r}}^{A}(\mathbf{p}\mathbf{r},d+1)\hfill \\ {\omega}_{4}(\mathbf{p},\mathbf{r})\cdot \underset{i}{\mathrm{max}}{C}_{\mathbf{r}}^{A}(\mathbf{p}\mathbf{r},i).\hfill \end{array}\hfill \\ & s.t.\hfill & \sum _{i=0,1,2,3,4}{\omega}_{i}(\mathbf{p},\mathbf{r})=1\hfill \end{array}$$  (10) 
The aggregations are done in four directions, namely $\mathbf{r}=\{(0,1),(0,1),(1,0),(1,0)\}$. Taking the right to left propagation ($\mathbf{r}=(0,1)$) as an example, we can construct a propagation graph in Fig. 5(a). The $y$coordinate represents disparity $d$, and the $x$coordinate represents the indexes of the pixels/nodes. Compared to our nonlocal graph in Fig. 4(b), edges connecting top and bottom nodes are removed, and the maximum of each column is densely connected to every node of the next column (red edges). The SGA layer can then be realized by our proposed nonlocal filter in Eq. (8). Here, $(\mathbf{p}\mathbf{r},d\pm 1)$ are the neighborhood nodes of $\mathbf{p}$, and ${\omega}_{0,\mathrm{\dots}4}$ are the corresponding edge weights.
The Affinitybased Spatial Propagation in [27] can be achieved as:
$$\begin{array}{ccc}{C}^{A}(\mathbf{p},d)\hfill & =\hfill & \left(1\sum _{\mathbf{q}\in {N}_{\mathbf{p}},\mathbf{q}\ne \mathbf{p}}{\omega}_{e(\mathbf{q},\mathbf{p})}\right)C(\mathbf{p})\hfill \\ & +\hfill & \sum _{\mathbf{q}\in {N}_{\mathbf{p}},\mathbf{q}\ne \mathbf{p}}{\omega}_{e(\mathbf{q},\mathbf{p})}{C}^{A}(\mathbf{q}),\hfill \end{array}$$  (11) 
where ${\omega}_{e(\mathbf{q},\mathbf{p})}$ are the learned affinities. $1{\sum}_{\mathbf{q}\in {N}_{\mathbf{p}}}{\omega}_{e(\mathbf{q},\mathbf{p})}$ is equal to our weight ${\omega}_{\mathbf{e}(\mathbf{p},\mathbf{p})}$ for $\mathbf{p}$. The graphs for filtering can be constructed as in Fig. 5(b) and 5(c) for the oneway and threeway propagations [27], respectively.
The Nonlocal Neural Networks and Attentions [48, 50, 15] are implemented without spatial and structural awareness. The similarity definition between two pixels only considers the feature differences without considering their spatial distances. Therefore, they will easily smooth out depth edges and thin structures (as illustrated in the supplementary material). Our nonlocal filter spatially aggregates the message along the paths in the graph which can avoid over smoothness and better preserve the structure of the disparity maps.
3.3 Network Architecture
As illustrated in Fig. 6, we utilize the backbone of GANet as the baseline architecture. The local guided aggregation layer in [56] is removed since it’s domaindependent and captures a lot of local patterns that are very sensitive to local domain shifts.
We replace the original batch normalization layer by our proposed domain normalization layer for feature extraction. For the feature extraction network, we utilize a total of seven proposed filtering layers. For 3D cost aggregation of the cost volume, two nonlocal filters are further added for cost volume filtering in each channel/depth. All the details of the network architecture are presented in Table I in the supplementary material.
4 Experimental Results
In our experiments, we train our method only with synthetic data and test it on four real datasets to evaluate its domain generalization ability. During training, we use disparity regression [17] for disparity prediction, and the smooth ${L}_{1}$ loss to compute the errors for backpropagation (the same as in [56, 4]). All the models are optimized with Adam (${\beta}_{1}=0.9$, ${\beta}_{2}=0.999$). We train with a batch size of 8 on four GPUs using $288\times 624$ random crops from the input images. The maximum of the disparity is set as 192. We train the model on the synthetic dataset for 10 epochs with a constant learning rate of 0.001. All other training settings are kept the same as those in [56].
4.1 Datasets
KITTI stereo 2012 [9] and 2015 [31] datasets provide about 400 image pairs of outdoor driving scenes for training, where the disparity labels are transformed from Velodyne LiDAR points. The Cityscapes dataset [7] provides a large amount of highresolution ($1k\times 2k$) stereo images collected from outdoor city driving scenes. The disparity labels are precomputed by SGM [13] which is not accurate enough for training deep neural network models. The Middlebury stereo dataset [40] is designed for indoor scenes with higher resolution (up to $2k\times 3k$). But it provides no more than 50 image pairs that are not enough to train robust deep neural networks. In addition, ETH 3D dataset [41] provides 27 pairs of gray images for training.
These existing real datasets are all limited by their small quantity or poor groundtruth labels, making them insufficient for training deep learning models. Hence, we just use them as test sets for evaluating our models’ crossdomain generalization ability.
We mainly use synthetic data to train our domaininvariant models. The existing Scene Flow synthetic dataset [30] contains 35k training image pairs with a resolution of $540\times 960$. This dataset has a limited number of the outdoor driving scenes that provide stereo pairs with a few settings of the camera baselines and image resolutions. We use CARLA [8] to generate a new supplementary synthetic dataset (with 20k stereo pairs) with more diverse settings, including two kinds of image resolutions ($720\times 1080$ and $1080\times 1920$), three different focal lengths, and five different camera baselines (in a range of 0.21.5m). This supplementary dataset can significantly improve the diversity of the training set (which will be published with the paper).
The two advantages in using synthetic data are that it can avoid all the difficulties of labeling a large amount of real data, and that it can eliminate the negative influence of wrong depth values in real datasets.
4.2 Ablation Study
We evaluate the performance of our DSMNet with numerous settings, including different architectures, normalization strategies and numbers (09) of the proposed nonlocal filter (NLF) layers. As listed in Table 1, the fullsetting DSMNet far outperforms the baseline in accuracy by 3% on the KITTI and 8% on the Middlebury datasets. Our proposed domain normalization improves the accuracy by about 1.5%, and the NLF layers contribute another 1.4% on the KITTI dataset.
Normlize  Nonlocal Filter  Backbone  Midd  KITTI  

feature  cost volume  3pixel  2pixel  
BN  ours  30.3  9.4  
DN  ours  27.1  7.9  
DN  +3  ours  24.2  7.1  
DN  +7  ours  22.9  6.8  
DN  +9  ours  22.4  6.8  
DN  +7  +2  ours  21.8  6.5 
BN  PSMNet  39.5  16.3  
BN  GANet  32.2  11.7  
DN  +7  +2  PSMNet  26.1  8.5 
DN  +7  +2  GANet  23.7  7.3 
Moreover, our proposed layers are generic and could be seamlessly integrated into other deep stereo matching models. Here, we replace our backbone model with GANet [56] and PSMNet [4]. The accuracies are improved by 4$\sim $8% on KIITTI dataset and 8$\sim $13% on Middlebury dataset for cossdomain evaluations compared with the original PSMNet and GANet.
4.3 Component Analysis and Comparisons
To further validate the superiorities of the proposed layers , we compare each of them with other related normalization and nonlocal strategies.
Models  Middlebury (full)  KITTI 

Batch Norm  29.1  7.3 
Instance Norm  27.1  6.4 
Adaptive Norm[33]  28.2  6.8 
Attention[15]  25.2  5.9 
Feature Denoising[50]  25.9  6.1 
Affinity [27]  23.1  5.2 
DSMNet (full setting)  20.1  4.1 
Normalization Strategies.
Table 2 compares our domain normalization with batch normalization [16], instance normalization [47], and the recently proposed adaptive batchinstance normalization [33]. We keep all other settings the same as our DSMNet and only replace the normalization method for training and evaluation. Our domain normalization is superior to others for domaininvariant stereo matching because it can fully regulate the feature vectors’ distribution and remove both imagelevel and local contrast differences for crossdomain generalization.
Nonlocal Approaches.
Finally, we compare our graphbased nonlocal filter with other related strategies, including affinitybased propagation [27], nonlocal neural network denoising [50], and nonlocal attention [15] (in Table 2). Our graphbased filtering strategy is better for capturing the structural and geometric context for robust domaininvariant stereo matching. The nonlocal neural network denoising [50] and nonlocal attention [15] do not have spatial constraints that usually lead to over smoothness of the depth edges (as shown in the supplementary material). Affinitybased propagations [27] are special cases of our proposed filtering strategy and are not as effective in feature and cost volume aggregations for stereo matching.
Models 





CostFilter[14]  21.7  18.9  57.2  40.5  17.6  31.1  41.1  
PatchMatch[2]  20.1  17.2  50.2  38.6  16.1  24.1  30.1  
SGM[13]  7.1  7.6  38.1  25.2  10.7  12.9  20.2  
Training set  SceneFlow  
HD${}^{3}$[53]  23.6  26.5  50.3  37.9  20.3  54.2  35.7  
gwcnet[12]  20.2  22.7  47.1  34.2  18.1  30.1  33.2  
PSMNet[4]  15.1  16.3  39.5  25.1  14.2  23.8  25.9  
GANet[56]  10.1  11.7  32.2  20.3  11.2  14.1  18.8  
Our DSMNet  6.2  6.5  21.8  13.8  8.1  6.2  9.8  
Training set  SceneFlow + Carla  
HD${}^{3}$[53]  19.1  19.5  47.3  35.2  19.5  45.2  –  
gwcnet[12]  17.2  18.1  45.2  31.8  17.2  29.4  –  
PSMNet[4]  10.3  11.0  35.5  23.7  13.8  20.3  –  
GANet[56]  7.2  7.6  31.9  19.7  11.4  13.5  –  
Our DSMNet  3.9  4.1  20.1  13.6  8.2  6.0  – 
4.4 CrossDomain Evaluations
In this section, we compare our proposed DSMNet with stateoftheart stereo matching models by training with synthetic data and evaluating on real test sets.
Comparisons with StateoftheArt Models.
In Table 3 and Fig. 7, we compare our DSMNet with other stateoftheart deep neural network models on the four real datasets. All the models are trained on synthetic data (either SceneFlow or a mixture of SceneFlow and Carla). We find that DSMNet far outperforms the stateoftheart models by 3$\sim $30% in error rates on all these datasets. It is also far better than traditional stereo matching algorithms, like SGM [13], costfilter [14] and patchmatch [2].
Models  Training Set  Error Rates (%) 

Our DSMNet  Synthetic  3.71 
MCCNNacrt[54]  Kittigt  3.89 
DispNetC[30]  Kittigt  4.34 
ContentCNN[29]  Kittigt  4.54 
MADNetfinetune[45]  Kittigt  4.66 
Weak Supervise[46]  Kittigt  4.97 
MADNet[45]  Kitti (no gt)  8.23 
OASMNet[19]  Kitti (no gt)  8.98 
Unsupervised[59]  Kitti (no gt)  9.91 
Evaluation on the KITTI Benchmark.
Table 4 presents the performance of our DSMNet on the KITTI benchmark [31]. Our model far outperforms most of the unsupervised/selfsupervised models trained on the KITTI domain. It is even better than supervised stereo matching networks (including, MCCNN[54], contentCNN[29], and DispNetC [30]) trained or finetuned on the KITTI dataset. When compared with other finetuned stateoftheart models (e.g. PSMNet[4], HD${}^{3}$[53], GANetdeep[56]), our DSMNet (without finetuning) produces more accurate object boundaries (Fig. 8).
4.5 Finetuning
In this section, we show DSMNet’s best performance when finetuned on the target domain. We finetune the model pretrained on synthetic data for a further 700 epochs using the KITTI 2015 training set. The learning rate for finetuning begins at 0.001 for the first 300 epochs and decreases to 0.0001 for the rest. The results are submitted to the KITTI benchmarks for evaluations.
Table 5 compares the results of the finetuned DSMNet and those of other stateoftheart DNN models. We find that DSMNet outperforms most of the recent models (including PSMNet [4], HD${}^{3}$ [53], GwcNet [12] and GANet15[56]) by a noteworthy margin. This implies that DSMNet can achieve the same accuracy by finetuning on one specific dataset, without sacrificing accuracy to improve its crossdomain generalization ability.
We also separately test the effectiveness of our nonlocal filtering strategy. Using the current best “GANetdeep”[56] (including the Local Guided Aggregation layer) as the baseline, we add five filtering layers for feature extraction. All other settings are kept the same as the original GANet. After training on synthetic data and finetuning on the KITTI training dataset, the model gets a new stateoftheart accuracy (1.77%) on KITTI 2015 benchmark. This shows that our graphbased filter can improve not only crossdomain generalization but also the accuracy on the test domains.
4.6 Efficiency and Parameters
Our proposed nonlocal filtering is a linear process that can be realized efficiently. The inference time is increased slightly by no more than 5% compared with the baseline. Moreover, no any new parameter is introduced for the proposed domain normalization and nonlocal filtering layers. Detailed comparisons are available in the supplementary material.
5 Conclusion
In this paper, we have proposed two endtoend trainable neural network layers for our domaininvariant stereo matching network. Our novel domain normalization can fully regulate the distribution of learned features to address significant domain shifts, and our nonlocal graphbased filter can capture more robust nonlocal structural and geometric features for accurate disparity estimation in crossdomain situations. We have verified our model on four real datasets and have shown its superior accuracy when compared to other stateoftheart stereo matching networks in crossdomain generalization.
References
 [1] (2018) Metareg: towards domain generalization using metaregularization. In Advances in Neural Information Processing Systems (NIPS), pp. 998–1008. Cited by: §2.3.
 [2] (2011) PatchMatch stereostereo matching with slanted support windows.. In British Machine Vision Conference (BMVC), pp. 1–11. Cited by: §3.2.2, §4.4, Table 3.
 [3] (2017) Unsupervised pixellevel domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3722–3731. Cited by: §1.
 [4] (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5410–5418. Cited by: Table 6, Figure 10, 11(c), Figure 1, §1, §1, §1, §2.1, §2.1, §3.1, 7(c), 8(c), §4.2, §4.4, §4.5, Table 3, Table 5, §4.
 [5] (2013) Fast patchbased denoising using approximated patch geodesic paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1211–1218. Cited by: §3.2.1.
 [6] (2016) Universal correspondence network. In Advances in Neural Information Processing Systems (NIPS), pp. 2414–2422. Cited by: §2.3.
 [7] (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3213–3223. Cited by: Figure 10, §4.1.
 [8] (2017) CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: Appendix E, §4.1.
 [9] (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §4.1.
 [10] (2015) Domain generalization for object recognition with multitask autoencoders. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2551–2559. Cited by: §2.3.
 [11] (2018) Learning monocular depth by distilling crossdomain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500. Cited by: §1, §2.2.
 [12] (2019) Groupwise correlation stereo network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3282. Cited by: §2.1, §4.5, Table 3, Table 5.
 [13] (2008) Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2), pp. 328–341. Cited by: §1, §2.1, §3.2.3, §4.1, §4.4, Table 3.
 [14] (2013) Fast costvolume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 504–511. Cited by: §4.4, Table 3.
 [15] (2019) Ccnet: crisscross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 603–612. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
 [16] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.3.
 [17] (2017) Endtoend learning of geometry and context for deep stereo regression. CoRR, vol. abs/1703.04309. Cited by: §1, §1, §2.1, §2.1, §3.1, Table 5, §4.
 [18] (2018) StereoNet: guided hierarchical refinement for realtime edgeaware depth prediction. CoRR abs/1807.08865. Cited by: §2.1.
 [19] (2018) Occlusion aware stereo matching via cooperative unsupervised learning. In Asian Conference on Computer Vision, pp. 197–213. Cited by: Table 4.
 [20] (2017) Deeper, broader and artier domain generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5542–5550. Cited by: §2.3.
 [21] (2018) Learning to generalize: metalearning for domain generalization. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.3.
 [22] (2018) Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5400–5409. Cited by: §2.3.
 [23] (2018) Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 624–639. Cited by: §2.3.
 [24] (2018) Adaptive batch normalization for practical domain adaptation. Pattern Recognition 80, pp. 109–117. Cited by: §2.3.
 [25] (2018) Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2811–2820. Cited by: §2.1.
 [26] (2013) Joint geodesic upsampling of depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 169–176. Cited by: §3.2.1.
 [27] (2017) Learning affinity via spatial propagation networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1520–1530. Cited by: Figure 5, Figure 5, §3.2.3, §3.2.3, §4.3, Table 2.
 [28] (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 4898–4906. Cited by: §1.
 [29] (2016) Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5695–5703. Cited by: §1, §4.4, Table 4.
 [30] (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4040–4048. Cited by: Appendix E, Figure 10, Figure 1, §1, §2.1, §3.1, §4.1, §4.4, Table 4.
 [31] (2015) Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070. Cited by: Figure 10, Figure 1, §4.1, §4.4.
 [32] (2017) Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5715–5725. Cited by: §2.3.
 [33] (2018) Batchinstance normalization for adaptively styleinvariant neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 2558–2567. Cited by: §2.3, §3.1, §4.3, Table 2.
 [34] (2019) Multilevel context ultraaggregation for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3283–3291. Cited by: §2.1.
 [35] (2018) Two at once: enhancing learning and generalization capacities via ibnnet. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 464–479. Cited by: §2.3.
 [36] (2017) Cascade residual learning: a twostage convolutional neural network for stereo matching. IEEE International Conference on Computer Vision Workshops (ICCVW). Cited by: §2.1.
 [37] (2018) Zoom and learn: generalizing deep stereo matching to novel domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2070–2079. Cited by: §2.2.
 [38] (2019) Semantic image synthesis with spatiallyadaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2337–2346. Cited by: §3.1.
 [39] (2019) Guided stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 979–988. Cited by: §2.2.
 [40] (2014) Highresolution stereo datasets with subpixelaccurate ground truth. In German conference on pattern recognition, pp. 31–42. Cited by: Figure 10, §4.1.
 [41] (2017) A multiview stereo benchmark with highresolution images and multicamera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3260–3269. Cited by: §4.1.
 [42] (2019) EdgeStereo: an effective multitask learning network for stereo matching and edge detection. arXiv preprint arXiv:1903.01700. Cited by: §2.1, §3.1.
 [43] (201710) Unsupervised adaptation for deep stereo. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.
 [44] (2019) Learning to adapt for stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9661–9670. Cited by: §2.2.
 [45] (2019) Realtime selfadaptive deep stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 195–204. Cited by: §1, §1, §2.1, §2.2, §3.1, Table 4.
 [46] (2017) Weakly supervised learning of deep metrics for stereo reconstruction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1339–1348. Cited by: Table 4.
 [47] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.3.
 [48] (2018) Nonlocal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §3.2.3, §3.2.3.
 [49] (2018) Anytime stereo image depth estimation on mobile devices. arXiv preprint arXiv:1810.11408. Cited by: §2.1.
 [50] (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 501–509. Cited by: Figure 12, §F.3, §3.2.3, §3.2.3, §4.3, Table 2.
 [51] (2018) SegStereo: exploiting semantic information for disparity estimation. arXiv preprint arXiv:1807.11699. Cited by: §2.1.
 [52] (2012) A nonlocal cost aggregation method for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1402–1409. Cited by: §3.2.1.
 [53] (2019) Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6044–6053. Cited by: Figure 10, 11(b), Figure 1, §1, §2.1, 7(b), 8(d), §4.4, §4.5, Table 3, Table 5.
 [54] (2015) Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1592–1599. Cited by: Domaininvariant Stereo Matching Networks, §1, §2.1, 8(b), §4.4, Table 4.
 [55] (2015) Segment graph based image filtering: fast structurepreserving smoothing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 361–369. Cited by: §3.2.1.
 [56] (2019) GAnet: guided aggregation net for endtoend stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 185–194. Cited by: Table 6, Figure 10, Figure 1, Figure 1, §1, §1, §1, §1, §2.1, §2.1, Figure 5, Figure 5, Figure 6, §3.1, §3.2.3, §3.2.3, §3.3, 8(e), §4.2, §4.4, §4.5, §4.5, Table 3, Table 5, §4.
 [57] (2018) Fundamental principles on learning new features for effective dense matching. IEEE Transactions on Image Processing 27 (2), pp. 822–836. Cited by: §2.1.
 [58] (2017) Selfsupervised learning for stereo matching with selfimproving ability. arXiv preprint arXiv:1709.00930. Cited by: §2.2.
 [59] (2017) Unsupervised learning of stereo matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1567–1575. Cited by: §1, §2.2, Table 4.
Supplementary Material
Appendix A Proof of Footnote 1
Following all the variable definitions in the paper, here, we prove that
$$\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})=1,\text{if}\sum _{\mathbf{q}\in {N}_{\mathbf{p}}}{\omega}_{e(\mathbf{q},\mathbf{p})}=1.$$  (12) 
Since any path which reaches node $\mathbf{p}$ must pass through its neighborhoods $\mathbf{q}$, we can expand $W(\mathbf{q},\mathbf{p})$ to get that
$$\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},\mathbf{p})={\omega}_{e(\mathbf{p},\mathbf{p})}+\sum _{{\mathbf{p}}^{\prime}\in {N}_{\mathbf{p}},{\mathbf{p}}^{\prime}\ne \mathbf{p}}{\omega}_{e({\mathbf{p}}^{\prime},\mathbf{p})}\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},{\mathbf{p}}^{\prime})$$ 
Following the order of ${\mathbf{p}}_{0},{\mathbf{p}}_{1}\mathrm{\dots}{\mathbf{p}}_{n}\mathrm{\dots}{\mathbf{p}}_{N}$ (Fig. 4), we can prove Eq. (12) by mathematical induction:
When $n=0$, for ${\mathbf{p}}_{0}$, $\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},{\mathbf{p}}_{0})=W({\mathbf{p}}_{0},{\mathbf{p}}_{0})={\omega}_{e({\mathbf{p}}_{0},{\mathbf{p}}_{0})}=1$
Assume when $n\le t$, $\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},{\mathbf{p}}_{n})=1$.
We can get that for $n=t+1$:
$$\begin{array}{ccc}\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},{\mathbf{p}}_{t+1})\hfill & =\hfill & {\omega}_{e({\mathbf{p}}_{t+1},{\mathbf{p}}_{t+1})}+\sum _{{\mathbf{p}}_{k}\in {N}_{{\mathbf{p}}_{t+1}},{\mathbf{p}}_{k}\ne {\mathbf{p}}_{t+1}}{\omega}_{e({\mathbf{p}}_{k},{\mathbf{p}}_{t+1})}\sum _{\mathbf{q}\in {G}_{i}}W(\mathbf{q},{\mathbf{p}}_{k})\hfill \\ & =\hfill & {\omega}_{e({\mathbf{p}}_{t+1},{\mathbf{p}}_{t+1})}+\sum _{{\mathbf{p}}_{k}\in {N}_{{\mathbf{p}}_{t+1}},{\mathbf{p}}_{k}\ne {\mathbf{p}}_{t+1}}{\omega}_{e({\mathbf{p}}_{k},{\mathbf{p}}_{t+1})}\cdot 1\hfill \\ & =\hfill & \sum _{{\mathbf{p}}_{k}\in {N}_{{\mathbf{p}}_{t+1}}}{\omega}_{e({\mathbf{p}}_{k},{\mathbf{p}}_{t+1})}\hfill \\ & =\hfill & 1.\hfill \end{array}$$ 
Here, $k\le t$, since ${\mathbf{p}}_{k}\in {N}_{{\mathbf{p}}_{t+1}}$.
This yields the equivalence of Eq. (12).
Appendix B Backpropagation
The backpropagation for ${\omega}_{e}$ and $C(\mathbf{p})$ in Eq. (9) can be computed inversely. Assume the gradient from next layer is $\frac{\partial E}{\partial {C}_{i}^{A}}$. The backpropagation can be implemented as:
$$\begin{array}{c}\frac{\partial E}{\partial C(\mathbf{p})}=\frac{\partial E}{\partial {C}_{i}^{b}(\mathbf{p})}\cdot {\omega}_{e(\mathbf{p},\mathbf{p})},\hfill \\ \frac{\partial E}{\partial {\omega}_{e(\mathbf{p},\mathbf{p})}}=\frac{\partial E}{\partial {C}_{i}^{b}(\mathbf{p})}\cdot C(\mathbf{p}),\hfill \\ \frac{\partial E}{\partial {\omega}_{e(\mathbf{q},\mathbf{p})}}=\frac{\partial E}{\partial {C}_{i}^{b}(\mathbf{p})}\cdot {C}_{i}^{A}(\mathbf{q}),\mathbf{q}\in {N}_{\mathbf{p}}\&\mathbf{q}\ne \mathbf{p}\hfill \end{array}$$  (13) 
where, $\frac{\partial E}{\partial {C}_{\mathbf{i}}^{b}}$ is a temporary gradient variable which can be calculated iteratively (similar to Eq. (9)):
$$\frac{\partial E}{\partial {C}_{i}^{b}(\mathbf{p})}=\frac{\partial E}{\partial {C}_{i}^{A}(\mathbf{p})}+\sum _{\mathbf{q}\in {N}_{\mathbf{p}},\mathbf{q}\ne \mathbf{p}}\frac{\partial E}{\partial {C}_{i}^{b}(\mathbf{q})}\cdot {\omega}_{e}(\mathbf{q},\mathbf{p})$$  (14) 
The propagation of Eq. (14) is an inverse process and in an order of ${\mathbf{p}}_{N},{\mathbf{p}}_{N1},\mathrm{\dots}{\mathbf{p}}_{0}$
Appendix C Details of the Architecture
Table 8 presents the details of the parameters of the DSMNet. It has seven nonlocal filtering layers which are used in feature extraction and cost aggregation. The proposed Domain Normalization layer is used to replace Batch Normalization after each 2D convolutional layer in the feature extraction and guidance networks.
Appendix D Efficiency and Parameters
As shown in Table 6, our proposed nonlocal filtering is a linear process that can be realized efficiently. The inference time is increased by about 5% compared with the baseline. Moreover, no any new parameters are introduced for the proposed domain normalization and nonlocal filtering layers.
Appendix E Carla Dataset
Since the synthetic Sceneflow dataset [30] only has limited number about 7,000 of stereo pairs for diving scenes, we use the Carla [8] platform to produce the stereo pairs for outdoor driving scenes. As shown in Table 7, the new carla supplementary dataset has more diverse settings, including two kinds of image resolutions ($720\times 1080$ and $1080\times 1920$), three different focal lengths, and six different camera baselines (in a range of 0.21.5m). This supplementary dataset can significantly improve the diversity of the training set. As shown in Fig. 9, the Carla scenes still have significant domain differences (e.g. color, textures) compared with the real scenes (e.g. KITTI, CityScapes), but, our DSMNet can extract shape and structure information for robust stereo matching. These can be better transferred to the real scenes and produce more accurate disparity estimation.
dataset  number of pairs  focal length  baseline settings  resolutions 

SceneFlow  34,000  450, 1050  0.54  $960\times 540$ 
Carla Stereo  20,000  640, 670, 720  0.2, 0.3, 0.5, 1.0, 1.2, 1.5  $1280\times 720$, $1920\times 1080$ 
No.  Layer Description  Output Tensor  
Feature Extraction  
input  normalized image pair as input  H$\times $W$\times $3  
1  3$\times $3 conv, DN, ReLU  H$\times $W$\times $32  
2  3$\times $3 conv, stride 3, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
3  3$\times $3 conv, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
4  NLF, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
5  3$\times $3 conv, stride 2, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $48  
6  NLF, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $48  
7  3$\times $3 conv, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $48  
89  repeat 5,7  $\text{sfrac}112$H$\times $$\text{sfrac}112$W$\times $64  
1011  repeat 89  $\text{sfrac}124$H$\times $$\text{sfrac}124$W$\times $96  
1213  repeat 89  $\text{sfrac}148$H$\times $$\text{sfrac}148$W$\times $128  
14  3$\times $3 deconv, stride 2, DN, ReLU  $\text{sfrac}124$H$\times $$\text{sfrac}124$W$\times $96  
15  3$\times $3 conv, DN, ReLU  $\text{sfrac}124$H$\times $$\text{sfrac}124$W$\times $96  
1617  repeat 1415  $\text{sfrac}112$H$\times $$\text{sfrac}112$W$\times $64  
1819  repeat 1415  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $48  
20  NLF, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $48  
2122  repeat 1415  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
2341  repeat 422  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
42  NLF, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
concatenation  (11,14) (9,16) (7,18) (4,21) (20,24) (17,27) (15,29) (13,31) (18,25) (30,33) (28,35) (26,37) (23, 40)  

by feature concatenation  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
Guidance Branch  
input  concate 1 and upsampled 35 as input  H$\times $W$\times $64  
(1)  3$\times $3 conv, DN, ReLU  H$\times $W$\times $16  
(2)  3$\times $3 conv, stride 3, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
(3)  3$\times $3 conv, DN, ReLU  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $32  
(4)  3$\times $3 conv (no bn & relu)  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $20  
(5)  split, reshape, normalize  $4\times $ $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $5  
(6)(8)  from (3), repeat (3)(5)  $4\times $ $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $5  
(9)(11)  from (6), repeat (6)(8)  $4\times $ $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $5  
(12)  from (2), 3$\times $3 conv, stride 2, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32  
(13)  3$\times $3 conv, DN, ReLU  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32  
(14)  3$\times $3 conv (no bn & relu)  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $20  
(15)  split, reshape, normalize  $4\times $ $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $5  
(16)(18)  from (13), repeat (13)(15)  $4\times $ $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $5  
(19)(21)  from (16), repeat (13)(15)  $4\times $ $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $5  
(22)(24)  from (19), repeat (13)(15)  $4\times $ $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $5  
Cost Aggregation  
input  4D cost volume  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $64  
$\left[1\right]$  3$\times $3$\times $3, 3D conv  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[2\right]$  SGA: weight matrices from (5)  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[3\right]$  NLF  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[4\right]$  3$\times $3$\times $3, 3D conv  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
output  3$\times $3$\times $3, 3D to 2D conv, upsamping  H$\times $W$\times $193  
softmax, regression, loss weight: 0.2  H$\times $W$\times $1  
$\left[5\right]$  3$\times $3$\times $3, 3D conv, stride 2  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[6\right]$  3$\times $3$\times $3, 3D conv  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[7\right]$  SGA: weight matrices from (15)  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[8\right]$  3$\times $3$\times $3, 3D conv, stride 2  $\text{sfrac}112$H$\times $$\text{sfrac}112$W$\times $16$\times $64  
$\left[9\right]$  3$\times $3$\times $3, 3D deconv, stride 2  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[10\right]$  3$\times $3$\times $3, 3D conv  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[11\right]$  SGA: weight matrices from (18)  $\text{sfrac}16$H$\times $$\text{sfrac}16$W$\times $32$\times $48  
$\left[12\right]$  3$\times $3$\times $3, 3D deconv, stride 2  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[13\right]$  3$\times $3$\times $3, 3D conv  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[14\right]$  SGA: weight matrices from (8)  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
$\left[15\right]$  NLF  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
output  3$\times $3$\times $3, 3D to 2D conv, upsamping  H$\times $W$\times $193  
softmax, regression, loss weight: 0.6  H$\times $W$\times $1  
$\left[1626\right]$  repeat $\left[515\right]$  $\text{sfrac}13$H$\times $$\text{sfrac}13$W$\times $64$\times $32  
final output  3$\times $3$\times $3, 3D to 2D conv, upsamping  H$\times $W$\times $193  
regression, loss weight: 1.0  H$\times $W$\times $1  
connection  concate: (4,12), (7,9), (8,19), (11,16), (15,23), (18,20); add: (1,4) 
Appendix F More Results
F.1 Feature Visualization
As compared in Fig. 10, the features of the stateoftheart models are mainly local patterns which can have a lot of artifacts (e.g. noises) when suffering from domain shifts. Our DSMNet mainly captures the nonlocal structure and shape information, which are robust for crossdomain generalization. There is no artifacts in the feature maps of our DSMNet.
F.2 Disparity Results on Different Datasets
More results and comparisons are provided in Fig. 11. All the models are trained on the synthetic dataset and tested on the real KITTI, Middlebury, ETH3D and Cityscapes datasets.
F.3 Comparisons with Other Nonlocal Strategies
Our graphbased filtering strategy is better for capturing the structural and geometric context for robust domaininvariant stereo matching. The nonlocal neural network denoising [50] and nonlocal attention [15] do not have spatial constraints that usually lead to over smoothness of the depth edges and thin structures (as shown in Fig. 12).