Abstract
Convolutions are the fundamental building block of CNNs. The fact that theirweights are spatially shared is one of the main reasons for their widespreaduse, but it also is a major limitation, as it makes convolutions contentagnostic. We propose a pixeladaptive convolution (PAC) operation, a simple yeteffective modification of standard convolutions, in which the filter weightsare multiplied with a spatiallyvarying kernel that depends on learnable, localpixel features. PAC is a generalization of several popular filtering techniquesand thus can be used for a wide range of use cases. Specifically, wedemonstrate stateoftheart performance when PAC is used for deep joint imageupsampling. PAC also offers an effective alternative to fullyconnected CRF(FullCRF), called PACCRF, which performs competitively, while beingconsiderably faster. In addition, we also demonstrate that PAC can be used as adropin replacement for convolution layers in pretrained networks, resultingin consistent performance improvements.
Quick Read (beta)
PixelAdaptive Convolutional Neural Networks
Abstract
Convolutions are the fundamental building blocks of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it is also a major limitation, as it makes convolutions contentagnostic. We propose a pixeladaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate stateoftheart performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fullyconnected CRF (FullCRF), called PACCRF, which performs competitively compared to FullCRF, while being considerably faster. In addition, we also demonstrate that PAC can be used as a dropin replacement for convolution layers in pretrained networks, resulting in consistent performance improvements.
1 Introduction
Convolution is a basic operation in many image processing and computer vision applications and the major building block of Convolutional Neural Network (CNN) architectures. It forms one of the most prominent ways of propagating and integrating features across image pixels due to its simplicity and highly optimized CPU/GPU implementations. In this work, we concentrate on two important characteristics of standard spatial convolution and aim to alleviate some of its drawbacks: Spatial Sharing and its ContentAgnostic nature.
Spatial Sharing: A typical CNN shares filters’ parameters across the whole input. In addition to affording translation invariance to the CNN, spatially invariant convolutions significantly reduce the number of parameters compared with fully connected layers. However, spatial sharing is not without drawbacks. For dense pixel prediction tasks, such as semantic segmentation, the loss is spatially varying because of varying scene elements on a pixel grid. Thus the optimal gradient direction for parameters differs at each pixel. However, due to the spatial sharing nature of convolution, the loss gradients from all image locations are globally pooled to train each filter. This forces the CNN to learn filters that minimize the error across all pixel locations at once, but may be suboptimal at any specific location.
ContentAgnostic: Once a CNN is trained, the same convolutional filter banks are applied to all the images and all the pixels irrespective of their content. The image content varies substantially across images and pixels. Thus a single trained CNN may not be optimal for all image types (e.g., images taken in daylight and at night) as well as different pixels in an image (e.g., sky vs. pedestrian pixels). Ideally, we would like CNN filters to be adaptive to the type of image content, which is not the case with standard CNNs. These drawbacks can be tackled by learning a large number of filters in an attempt to capture both image and pixel variations. This, however, increases the number of parameters, requiring a larger memory footprint and an extensive amount of labeled data. A different approach is to use contentadaptive filters inside the networks.
Existing contentadaptive convolutional networks can be broadly categorized into two types. One class of techniques make traditional imageadaptive filters, such as bilateral filters [2, 42] and guided image filters [18] differentiable, and use them as layers inside a CNN [25, 29, 52, 11, 9, 21, 30, 8, 13, 31, 44, 46]. These contentadaptive layers are usually designed for enhancing CNN results but not as a replacement for standard convolutions. Another class of contentadaptive networks involve learning positionspecific kernels using a separate subnetwork that predicts convolutional filter weights at each pixel. These are called “Dynamic Filter Networks” (DFN) [48, 22, 12, 47] (also referred to as crossconvolution [48] or kernel prediction networks [4]) and have been shown to be useful in several computer vision tasks. Although DFNs are generic and can be used as a replacement to standard convolution layers, such a kernel prediction strategy is difficult to scale to an entire network with a large number of filter banks.
In this work, we propose a new contentadaptive convolution layer that addresses some of the limitations of the existing contentadaptive layers while retaining several favorable properties of spatially invariant convolution. Fig. 1 illustrates our contentadaptive convolution operation, which we call “PixelAdaptive Convolution” (PAC). Unlike a typical DFN, where different kernels are predicted at different pixel locations, we adapt a standard spatially invariant convolution filter $\mathbf{W}$ at each pixel by multiplying it with a spatially varying filter $K$, which we refer to as an “adapting kernel”. This adapting kernel has a predefined form (e.g., Gaussian or Laplacian) and depends on the pixel features. For instance, the adapting kernel that we mainly use in this work is Gaussian: ${e}^{\frac{1}{2}{{\mathbf{f}}_{i}{\mathbf{f}}_{j}}^{2}}$, where ${\mathbf{f}}_{i}\in {\mathbb{R}}^{d}$ is a $d$dimensional feature at the ${i}^{th}$ pixel. We refer to these pixel features $\mathbf{f}$ as “adapting features”, and they can be either predefined, such as pixel position and color features, or can be learned using a CNN.
We observe that PAC, despite being a simple modification to standard convolution, is highly flexible and can be seen as a generalization of several widelyused filters. Specifically, we show that PAC is a generalization of spatial convolution, bilateral filtering [2, 42], and pooling operations such as average pooling and detailpreserving pooling [36]. We also implement a variant of PAC that does pixeladaptive transposed convolution (also called deconvolution) which can be used for learnable guided upsampling of intermediate CNN representations. We discuss more about these generalizations and variants in Sec. 3.
As a result of its simplicity and being a generalization of several widely used filtering techniques, PAC can be useful in a wide range of computer vision problems. In this work, we demonstrate its applicability in three different vision problems. In Sec. 4, we use PAC in joint image upsampling networks and obtain stateoftheart results on both depth and optical flow upsampling tasks. In Sec. 5, we use PAC in a learnable conditional random field (CRF) framework and observe consistent improvements with respect to the widely used fullyconnected CRF [25]. In Sec. 6, we demonstrate how to use PAC as a dropin replacement of trained convolution layers in a CNN and obtain performance improvements after finetuning. In summary, we observe that PAC is highly versatile and has wide applicability in a range of computer vision tasks.
2 Related Work
Imageadaptive filtering. Some important imageadaptive filtering techniques include bilateral filtering [2, 42], guided image filtering [18], nonlocal means [6, 3], and propagated image filtering [35], to name a few. A common line of research is to make these filters differentiable and use them as contentadaptive CNN layers. Early work [52, 11] in this direction backpropagates through bilateral filtering and can thus leverage fullyconnected CRF inference [25] on the output of CNNs. The work of [21] and [13] proposes to use bilateral filtering layers inside CNN architectures. Chandra et al. [8] propose a layer that performs closedform Gaussian CRF inference in a CNN. Chen et al. [9] and Liu et al. [31] propose differentiable local propagation modules that have roots in domain transform filtering [14]. Wu et al. [46] and Wang et al. [44] propose neural network layers to perform guided filtering [18] and nonlocal means [44] respectively inside CNNs. Since these techniques are tailored towards a particular CRF or adaptive filtering technique, they are used for specific tasks and cannot be directly used as a replacement of general convolution. Closest to our work are the sparse, highdimensional neural networks [21] which generalize standard 2D convolutions to highdimensional convolutions, enabling them to be contentadaptive. Although conceptually more generic than PAC, such highdimensional networks can not learn the adapting features and have a larger computational overhead due to the use of specialized lattices and hash tables.
Dynamic filter networks. Introduced by Jia et al. [22], dynamic filter networks (DFN) are an example of another class of contentadaptive filtering techniques. Filter weights are themselves directly predicted by a separate network branch, and provide custom filters specific to different input data. The work is later extended by Wu et al. [47] with an additional attention mechanism and a dynamic sampling strategy to allow the positionspecific kernels to also learn from multiple neighboring regions. Similar ideas have been applied to several taskspecific use cases, e.g., motion prediction [48], semantic segmentation [17], and Monte Carlo rendering denoising [4]. Explicitly predicting all positionspecific filter weights requires a large number of parameters, so DFNs typically require a sensible architecture design and are difficult to scale to multiple dynamicfilter layers. Our approach differs in that PAC reuses spatial filters just as standard convolution, and only modifies the filters in a positionspecific fashion. Dai et al. propose deformable convolution [12], which can also produce positionspecific modifications to the filters. Different from PAC, the modifications there are represented as offsets with an emphasis on learning geometricinvariant features.
Selfattention mechanism. Our work is also related to the selfattention mechanism originally proposed by Vaswani et al. for machine translation [43]. Selfattention modules compute the responses at each position while attending to the global context. Thanks to the use of global information, selfattention has been successfully used in several applications, including image generation [51, 34] and video activity recognition [44]. Attending to the whole image can be computationally expensive, and, as a result, can only be afforded on lowdimensional feature maps, e.g., as in [44]. Our layer produces responses that are sensitive to a more local context (which can be alleviated through dilation), and is therefore much more efficient.
3 PixelAdaptive Convolution
In this section, we start with a formal definition of standard spatial convolution and then explain our generalization of it to arrive at our pixeladaptive convolution (PAC). Later, we will discuss several variants of PAC and how they are connected to different image filtering techniques. Formally, a spatial convolution of image features $\mathbf{v}=({\mathbf{v}}_{1},\mathrm{\dots},{\mathbf{v}}_{n}),{\mathbf{v}}_{i}\in {\mathbb{R}}^{c}$ over $n$ pixels and $c$ channels with filter weights $\mathbf{W}\in {\mathbb{R}}^{{c}^{\prime}\times c\times s\times s}$ can be written as
${\mathbf{v}}_{i}^{\prime}={\displaystyle \sum _{j\in \mathrm{\Omega}(i)}}\mathbf{W}\left[{\mathbf{p}}_{i}{\mathbf{p}}_{j}\right]{\mathbf{v}}_{j}+\mathbf{b}$  (1) 
where ${\mathbf{p}}_{i}={({x}_{i},{y}_{i})}^{\u22ba}$ are pixel coordinates, $\mathrm{\Omega}(\cdot )$ defines an $s\times s$ convolution window, and $\mathbf{b}\in {\mathbb{R}}^{{c}^{\prime}}$ denotes biases. With a slight abuse of notation, we use $[{\mathbf{p}}_{i}{\mathbf{p}}_{j}]$ to denote indexing of the spatial dimensions of an array with 2D spatial offsets. This convolution operation results in a ${c}^{\prime}$channel output, ${\mathbf{v}}_{i}^{\prime}\in {\mathbb{R}}^{{c}^{\prime}}$, at each pixel $i$. Eq. 1 highlights how the weights only depend on pixel position and thus are agnostic to image content. In other words, the weights are spatially shared and, therefore, imageagnostic. As outlined in Sec. 1, these properties of spatial convolutions are limiting: we would like the filter weights $\mathbf{W}$ to be contentadaptive.
One approach to make the convolution operation contentadaptive, rather than only based on pixel locations, is to generalize $\mathbf{W}$ to depend on the pixel features, $\mathbf{f}\in {\mathbb{R}}^{d}$:
${\mathbf{v}}_{i}^{\prime}={\displaystyle \sum _{j\in \mathrm{\Omega}(i)}}\mathbf{W}\left({\mathbf{f}}_{i}{\mathbf{f}}_{j}\right){\mathbf{v}}_{j}+\mathbf{b}$  (2) 
where $\mathbf{W}$ can be seen as a highdimensional filter operating in a $d$dimensional feature space. In other words, we can apply Eq. 2 by first projecting the input signal $\mathbf{v}$ into a $d$dimensional space, and then performing $d$dimensional convolution with $\mathbf{W}$. Traditionally, such highdimensional filtering is limited to handspecified filters such as Gaussian filters [1]. Recent work [21] lifts this restriction and proposes a technique to freely parameterize and learn $\mathbf{W}$ in highdimensional space. Although generic and used successfully in several computer vision applications [21, 20, 39], highdimensional convolutions have several shortcomings. First, since data projected on a higherdimensional space is sparse, special lattice structures and hash tables are needed to perform the convolution [1] resulting in considerable computational overhead. Second, it is difficult to learn features $\mathbf{f}$ resulting in the use of handspecified feature spaces such as position and color features, $\mathbf{f}=(x,y,r,g,b)$. Third, we have to restrict the dimensionality $d$ of features (say, $$) as the projected input image can become too sparse in highdimensional spaces. In addition, the advantages that come with spatial sharing of standard convolution are lost with highdimensional filtering.
Pixeladaptive convolution. Instead of bringing convolution to higher dimensions, which has the abovementioned drawbacks, we choose to modify the spatially invariant convolution in Eq. 1 with a spatially varying kernel $K\in {\mathbb{R}}^{{c}^{\prime}\times c\times s\times s}$ that depends on pixel features $\mathbf{f}$:
$${\mathbf{v}}_{i}^{\prime}=\sum _{j\in \mathrm{\Omega}(i)}K({\mathbf{f}}_{i},{\mathbf{f}}_{j})\mathbf{W}\left[{\mathbf{p}}_{i}{\mathbf{p}}_{j}\right]{\mathbf{v}}_{j}+\mathbf{b}$$  (3) 
where $K$ is a kernel function that has a fixed parametric form such as Gaussian: $K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=\mathrm{exp}(\frac{1}{2}{({\mathbf{f}}_{i}{\mathbf{f}}_{j})}^{\u22ba}({\mathbf{f}}_{i}{\mathbf{f}}_{j}))$. Since $K$ has a predefined form and is not parameterized as a highdimensional filter, we can perform this filtering on the 2D grid itself without moving onto higher dimensions. We call the above filtering operation (Eq. 3) as “PixelAdaptive Convolution” (PAC) because the standard spatial convolution $\mathbf{W}$ is adapted at each pixel using pixel features $\mathbf{f}$ via kernel $K$. We call these pixel features $\mathbf{f}$ as “adapting features” and the kernel $K$ as “adapting kernel”. The adapting features $\mathbf{f}$ can be either handspecified such as position and color features $\mathbf{f}=(x,y,r,g,b)$ or can be deep features that are learned endtoend.
Generalizations. PAC, despite being a simple modification to standard convolution, generalizes several widely used filtering operations, including

•
Spatial Convolution can be seen as a special case of PAC with adapting kernel being constant $K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=1$. This can be achieved by using constant adapting features, ${\mathbf{f}}_{i}={\mathbf{f}}_{j},\forall i,j$. In brief, standard convolution (Eq. 1) uses fixed, spatially shared filters, while PAC allows the filters to be modified by the adapting kernel $K$ differently across pixel locations.

•
Bilateral Filtering [42] is a basic image processing operation that has found wideranging uses [33] in image processing, computer vision and also computer graphics. Standard bilateral filtering operation can be seen as a special case of PAC, where $\mathbf{W}$ also has a fixed parametric form, such as a 2D Gaussian filter, $\mathbf{W}\left[{\mathbf{p}}_{i}{\mathbf{p}}_{j}\right]=\mathrm{exp}(\frac{1}{2}{({\mathbf{p}}_{i}{\mathbf{p}}_{j})}^{\u22ba}{\mathrm{\Sigma}}^{1}({\mathbf{p}}_{i}{\mathbf{p}}_{j}))$.

•
Pooling operations can also be modeled by PAC. Standard average pooling corresponds to the special case of PAC where $K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=1,\mathbf{W}=\frac{1}{{s}^{2}}\cdot \mathrm{\U0001d7cf}$. Detail Preserving Pooling [36, 45] is a recently proposed pooling layer that is useful to preserve highfrequency details when performing pooling in CNNs. PAC can model the detailpreserving pooling operations by incorporating an adapting kernel that emphasizes more distinct pixels in the neighborhood, e.g., $K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=\alpha +{\left({{\mathbf{f}}_{i}{\mathbf{f}}_{j}}^{2}+{\u03f5}^{2}\right)}^{\lambda}$.
The above generalizations show the generality and the wide applicability of PAC in different settings and applications. We experiment using PAC in three different problem scenarios, which will be discussed in later sections.
Some filtering operations are even more general than the proposed PAC. Examples include highdimensional filtering shown in Eq. 2 and others such as dynamic filter networks (DFN) [22] discussed in Sec. 2. Unlike most of those general filters, PAC allows efficient learning and reuse of spatially invariant filters because it is a direct modification of standard convolution filters. PAC offers a good tradeoff between standard convolution and DFNs. In DFNs, filters are solely generated by an auxiliary network and different auxiliary networks or layers are required to predict kernels for different dynamicfilter layers. PAC, on the other hand, uses learned pixel embeddings $\mathbf{f}$ as adapting features, which can be reused across several different PAC layers in a network. When related to sparse highdimensional filtering in Eq. 2, PAC can be seen as factoring the highdimensional filter into a product of standard spatial filter $\mathbf{W}$ and the adapting kernel $K$. This allows efficient implementation of PAC in 2D space alleviating the need for using hash tables and special lattice structures in high dimensions. PAC can also use learned pixel embeddings $\mathbf{f}$ instead of handspecified ones in existing learnable highdimensional filtering techniques such as [21].
Implementation and variants. We implemented PAC as a network layer in PyTorch with GPU acceleration^{1}^{1} 1 Code will be available at https://suhangpro.github.io/pac/. Our implementation enables backpropagation through the features $\mathbf{f}$, permitting the use of learnable deep features as adapting features. We also implement a PAC variant that does pixeladaptive transposed convolution (also called “deconvolution”). We refer to pixeladaptive convolution shown in Eq. 3 as PAC and the transposed counterpart as PAC${}^{\u22ba}$. Similar to standard transposed convolution, PAC${}^{\u22ba}$ uses fractional striding and results in an upsampled output. Our PAC and PAC${}^{\u22ba}$ implementations allow easy and flexible specification of different options that are commonly used in standard convolution: filter size, number of input and output channels, striding, padding and dilation factor.
4 Deep Joint Upsampling Networks
Joint upsampling is the task of upsampling a lowresolution signal with the help of a corresponding highresolution guidance image. An example is upsampling a lowresolution depth map given a corresponding highresolution RGB image as guidance. Joint upsampling is useful when some sensors output at a lower resolution than cameras, or can be used to speed up computer vision applications where fullresolution results are expensive to produce. PAC allows filtering operations to be guided by the adapting features, which can be obtained from a separate guidance image, making it an ideal choice for joint image processing. We investigate the use of PAC for joint upsampling applications. In this section, we introduce a network architecture that relies on PAC for deep joint upsampling, and show experimental results on two applications: joint depth upsampling and joint optical flow upsampling.
4.1 Deep joint upsampling with PAC
A deep joint upsampling network takes two inputs, a lowresolution signal $\mathbf{x}\in {\mathbb{R}}^{c\times h/m\times w/m}$ and a highresolution guidance $\mathbf{g}\in {\mathbb{R}}^{{c}_{g}\times h\times w}$, and outputs upsampled signal ${\mathbf{x}}_{\uparrow}\in {\mathbb{R}}^{c\times h\times w}$. Here $m$ is the required upsampling factor. Similar to [27], our upsampling network has three components (as illustrated in Fig. 2):

•
Encoder branch operates directly on the lowresolution signal with convolution (CONV) layers.

•
Guidance branch operates solely on the guidance image, and generates adapting features that will be used in all PAC${}^{\u22ba}$ layers later in the network.

•
Decoder branch starts with a sequence of PAC${}^{\u22ba}$, which perform transposed pixeladaptive convolution, each of which upsamples the feature maps by a factor of 2. PAC${}^{\u22ba}$ layers are followed by two CONV layers to generate the final upsampled output.
Each of the CONV and PAC${}^{\u22ba}$ layers, except the final one, is followed by a rectified linear unit (ReLU).
Method  4$\times $  8$\times $  16$\times $ 
Bicubic  8.16  14.22  22.32 
MRF  7.84  13.98  22.20 
GF [18]  7.32  13.62  22.03 
JBU [24]  4.07  8.29  13.35 
Ham et al. [15]  5.27  12.31  19.24 
DMSG [19]  3.78  6.37  11.16 
FBS [5]  4.29  8.94  14.59 
DJF [27]  3.54  6.20  10.21 
DJF+ [28]  3.38  5.86  10.11 
DJF (Our impl.)  2.64  5.15  9.39 
Ourslite  2.55  4.82  8.52 
Ours  2.39  4.59  8.09 
4.2 Joint depth upsampling
Here, the task is to upsample a lowresolution depth by using a highresolution RGB image as guidance. We experiment with the NYU Depth V2 dataset [38], which has 1449 RGBdepth pairs. Following [27], we use the first 1000 samples for training and the rest for testing. The lowresolution depth maps are obtained from the groundtruth depth maps using nearestneighbor downsampling. Tab. 1 shows root mean square error (RMSE) of different techniques and for different upsampling factors $m$ (4$\times $, 8$\times $, 16$\times $). Results indicate that our network outperforms others in comparison and obtains stateoftheart performance. Sample visual results are shown in Fig. 3.
We train our network with the Adam optimizer using a learning rate schedule of [${10}^{4}\times $ 3.5k, ${10}^{5}\times $ 1.5k, ${10}^{6}\times $ 0.5k] and with minibatches of 256$\times $256 crops. We found this training setup to be superior to the one recommended in DJF [27], and also compare with our own implementation of it under such a setting (“DJF (Our impl.)” in Tab. 1). We keep the network architecture similar to that of previous stateoftheart technique, DJF [27]. In DJF, features from the guidance branch are simply concatenated with encoder outputs for upsampling, whereas we use guidance features to adapt PAC${}^{\u22ba}$ kernels. Although with similar number of layers, our network has more parameters compared with DJF (see appendix for details). We also trained a lighter version of our network (“Ourslite”) that matches the number of parameters of DJF, and still observe better performance showing the importance of PAC${}^{\u22ba}$ for upsampling.
4.3 Joint optical flow upsampling
We also evaluate our joint upsampling network for upsampling lowresolution optical flow using the original RGB image as guidance. Estimating optical flow is a challenging task, and even recent stateoftheart approaches [40] resort to simple bilinear upsampling to predict optical flow at the full resolution. Optical flow is smoothly varying within motion boundaries, where accompanying RGB images can offer strong clues, making joint upsampling an appealing solution. We use the same network architecture as in the depth upsampling experiments, with the only difference being that instead of singlechannel depth, input and output are twochannel flow with $u,v$ components. We experiment with the Sintel dataset [7] (clean pass). The same training protocol in Sec. 4.2 is used, and the lowresolution optical flow is obtained from bilinear downsampling of the groundtruth. We compare with baselines of bilinear interpolation and DJF [27], and observe consistent advantage (Tab. 2). Fig. 3 shows a sample visual result indicating that our network is capable of restoring finestructured details and also produces smoother predictions in areas with uniform motion.
5 Conditional Random Fields
Early adoptions of CRFs in computer vision tasks were limited to regionbased approaches and shortrange structures [37] for efficiency reasons. FullyConnected CRF (FullCRF) [25] was proposed to offer the benefits of dense pairwise connections among pixels, which resorts to approximate highdimensional filtering [1] for efficient inference. Consider a semantic labeling problem, where each pixel $i$ in an image $I$ can take one of the semantic labels ${l}_{i}\in \{1,\mathrm{\dots},\mathcal{L}\}$. FullCRF has unary potentials usually defined by a classifier such as CNN: ${\psi}_{u}({l}_{i})\in {\mathbb{R}}^{\mathcal{L}}$. And, the pairwise potentials are defined for every pair of pixel locations $(i,j)$: ${\psi}_{p}({l}_{i},{l}_{j}I)=\mu ({l}_{i},{l}_{j})K({\mathbf{f}}_{i},{\mathbf{f}}_{j})$, where $K$ is a kernel function and $\mu $ is a compatibility function. A common choice for $\mu $ is the Potts model: $\mu ({l}_{i},{l}_{j})=[{l}_{i}\ne {l}_{j}]$. [25] utilizes two Gaussian kernels with handcrafted features as the kernel function:
$K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=$  ${w}_{1}\mathrm{exp}\left\{{\displaystyle \frac{{\parallel {\mathbf{p}}_{i}{\mathbf{p}}_{j}\parallel}^{2}}{2{\theta}_{\alpha}^{2}}}{\displaystyle \frac{{\parallel {I}_{i}{I}_{j}\parallel}^{2}}{2{\theta}_{\beta}^{2}}}\right\}$  
$+{w}_{2}\mathrm{exp}\left\{{\displaystyle \frac{{\parallel {\mathbf{p}}_{i}{\mathbf{p}}_{j}\parallel}^{2}}{2{\theta}_{\gamma}^{2}}}\right\}$  (4) 
where ${w}_{1},{w}_{2},{\theta}_{\alpha},{\theta}_{\beta},{\theta}_{\gamma}$ are model parameters, and are typically found by a gridsearch. Then, inference in FullCRF amounts to maximizing the following Gibbs distribution: $$, $\mathbf{l}=({l}_{1},{l}_{2},\mathrm{\dots},{l}_{n})$. Exact inference of FullCRF is hard, and [25] relies on meanfield approximation which is optimizing for an approximate distribution $Q(\mathbf{l})={\prod}_{i}{Q}_{i}({l}_{i})$ by minimizing the KLdivergence between $P(\mathbf{l}I)$ and the meanfield approximation $Q(\mathbf{l})$. This leads to the following meanfield (MF) inference step that updates marginal distributions ${Q}_{i}$ iteratively for $t=0,1,\mathrm{\dots}$ :
${Q}_{i}^{(t+1)}(l)\leftarrow $  $\frac{1}{{Z}_{i}}}\mathrm{exp}\{\psi {}_{u}(l)$  
${\displaystyle \sum _{{l}^{\prime}\in \mathcal{L}}}\mu (l,{l}^{\prime}){\displaystyle \sum _{j\ne i}}K({\mathbf{f}}_{i},{\mathbf{f}}_{j}){Q}_{j}^{(t)}({l}^{\prime})\}$  (5) 
The main computation in each MF iteration, ${\sum}_{j\ne i}K({\mathbf{f}}_{i},{\mathbf{f}}_{j}){Q}_{j}^{(t)}$, can be viewed as highdimensional Gaussian filtering. Previous work [25, 26] relies on permutohedral lattice convolution [1] to achieve efficient implementation.
5.1 Efficient, learnable CRF with PAC
Existing work [52, 21] backpropagates through the above MF steps to combine CRF inference with CNNs resulting in endtoend training of CNNCRF models. While there exists optimized CPU implementations, permutohedral lattice convolution cannot easily utilize GPUs because it “does not follow the SIMD paradigm of efficient GPU computation” [41]. Another drawback of relying on permutohedral lattice convolution is the approximation error incurred during both inference and gradient computation.
We propose PACCRF, which alleviates these computation issues by relying on PAC for efficient inference, and is easy to integrate with existing CNN backbones. PACCRF also has additional learning capacity, which leads to better performance compared with FullCRF in our experiments.
PACCRF. In PACCRF, we define pairwise connections over fixed windows ${\mathrm{\Omega}}^{k}$ around each pixel instead of dense connections: ${\sum}_{k}{\sum}_{i}{\sum}_{j\in {\mathrm{\Omega}}^{k}(i)}{\psi}_{p}^{k}({l}_{i},{l}_{j}I)$, where the $k$th pairwise potential is defined as
${\psi}_{p}^{k}({l}_{i},{l}_{j}I)={K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}_{j}{l}_{i}}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]$  (6) 
Here ${\mathrm{\Omega}}^{k}(\cdot )$ specifies the pairwise connection pattern of the $k$th pairwise potential originated from each pixel, and ${K}^{k}$ is a fixed Gaussian kernel. Intuitively, this formulation allows the label compatibility transform $\mu $ in FullCRF to be modeled by $\mathbf{W}$, and to vary across different spatial offsets. Similar derivation as in FullCRF yields the following iterative MF update rule (see appendix for more details):
${Q}_{i}^{(t+1)}(l)\leftarrow {\displaystyle \frac{1}{{Z}_{i}}}\mathrm{exp}\{\psi {}_{u}(l)$  
$\sum _{k}}\underset{\text{PAC}}{\underset{\u23df}{{\displaystyle \sum _{{l}^{\prime}\in \mathcal{L}}}{\displaystyle \sum _{j\in {\mathrm{\Omega}}^{k}(i)}}{K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}^{\prime}l}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]{Q}_{j}^{(t)}({l}^{\prime})}}\$  (7) 
MF update now consists of PAC instead of sparse highdimensional filtering as in FullCRF (Eq. 5). As outlined in Sec. 2, there are several advantages of PAC over highdimensional filtering. With PACCRF, we can freely parameterize and learn the pairwise potentials in Eq. 6 that also use a richer form of compatibility transform $\mathbf{W}$. PACCRF can also make use of learnable features $\mathbf{f}$ for pairwise potentials instead of predefined ones in FullCRF. Fig. 4 (left) illustrates the computation steps in each MF step with two pairwise PAC kernels.
Longrange connections with dilated PAC. The major source of heavy computation in FullCRF is the dense pairwise pixel connections. In PACCRF, the pairwise connections are defined by the local convolution windows ${\mathrm{\Omega}}^{k}$. To have longrange pairwise connections while keeping the number of PAC parameters managable, we make use of dilated filters [10, 49]. Even with a relatively small kernel size ($5\times 5$), with a large dilation, e.g., $64$, the CRF can effectively reach a neighborhood of $257\times 257$. A concurrent work [41] also propose a convolutional version of CRF (ConvCRF) to reduce the number of connections in FullCRF. However, [41] uses connections only within small local windows. We argue that longrange connections can provide valuable information, and our CRF formulation uses a wider range of connections while still being efficient. Our formulation allows using multiple PAC filters in parallel, each with different dilation factors. In Fig. 4 (right), we show an illustration of the coverage of two $5\times 5$ PAC filters, with dilation factors 16 and 64 respectively. This allows PACCRF to achieve a good tradeoff between computational efficiency and longrange pairwise connectivity.
5.2 Semantic segmentation with PACCRF
The task of semantic segmentation is to assign a semantic label to each pixel in an image. FullCRF is proven to be a valuable postprocessing tool that can considerably improve CNN segmentation performance [10, 52, 21]. Here, we experiment with PACCRF on top of the FCN semantic segmentation network [32]. We choose FCN for simplicity and ease of comparisons, as FCN only uses standard convolution layers and does not have many bells and whistles.
In the experiments, we use scaled RGB color, ${[\frac{R}{{\sigma}_{R}},\frac{G}{{\sigma}_{G}},\frac{B}{{\sigma}_{B}}]}^{\u22ba}$, as the guiding features for the PAC layers in PACCRF . The scaling vector ${[{\sigma}_{R},{\sigma}_{G},{\sigma}_{B}]}^{\u22ba}$ is learned jointly with the PAC weights $\mathbf{W}$. We try two internal configurations of PACCRF: a single 5$\times $5 PAC kernel with dilation of 32, and two parallel 5$\times $5 PAC kernels with dilation factors of 16 and 64. 5 MF steps are used for a good balance between speed and accuracy (more details in appendix). We first freeze the backbone FCN network and train only the PACCRF part for 40 epochs, and then train the whole network for another 40 epochs with reduced learning rates.
Dataset. We follow the training and validation settings of FCN [32] which is trained on PascalVOC images and validated on a reduced validation set of 736 images. We also submit our final trained models to the official evaluation server to get test scores on 1456 test images.
Baselines. We compare PACCRF with three baselines: FullCRF [25], BCLCRF [21], and ConvCRF [41]. For FullCRF, we use the publicly available C++ code, and find the optimal CRF parameters through grid search. For BCLCRF, we use $1$neighborhood filters to keep the runtime manageable and use other settings as suggested by the authors. For ConvCRF, the same training procedure is used as in PACCRF. We use the more powerful variant of ConvCRF with learnable compatibility transform (referred to as “Conv+C” in [41]), and we learn the RGB scales for ConvCRF in the same way as for PACCRF. We follow the suggested default settings for ConvCRF and use a filter size of 11$\times $11 and a blurring factor of 4. Note that like FullCRF (Eq. 4), the other baselines also use two pairwise kernels.
Method  mIoU (val / test)  CRF Runtime 
Unaries only (FCN)  65.51 / 67.20   
FullCRF [25]  +2.11 / +2.45  629 ms 
BCLCRF [21]  +2.28 / +2.33  2.6 s 
ConvCRF [41]  +2.13 / +1.57  38 ms 
PACCRF, 32  +3.01 / +2.21  39 ms 
PACCRF, 1664  +3.39 / +2.62  78 ms 
Results. Tab. 3 reports validation and test mean Intersection over Union (mIoU) scores along with average runtimes of different techniques. Our twofilter variant (“PACCRF, 1664”) achieves better mIoU compared with all baselines, and also compares favorably in terms of runtime. The onefilter variant (“PACCRF, 32”) performs slightly worse than FullCRF and BCLCRF, but has even larger speed advantage, offering a strong option where efficiency is needed. Sample visual results are shown in Fig. 5. While being quantitatively better and retaining more visual details overall, PACCRF produces some amount of noise around boundaries. This is likely due to a known “gridding” effect of dilation [50], which we hope to mitigate in future work.
6 Layer hotswapping with PAC
So far, we design specific architectures around PAC for different use cases. In this section, we offer a strategy to use PAC for simply upgrading existing CNNs with minimal modifications through what we call layer hotswapping.
Layer hotswapping. Network finetuning has become a common practice when training networks on new data or with additional layers. Typically, in finetuning, newly added layers are initialized randomly. Since PAC generalizes standard convolution layers, it can directly replace convolution layers in existing networks while retaining the pretrained weights. We refer to this modification of existing pretrained networks as layer hotswapping.
We continue to use semantic segmentation as an example, and demonstrate how layer hotswapping can be a simple yet effective modification to existing CNNs. Fig. 6 illustrates a FCN [32] before and after the hotswapping modifications. We swap out the last CONV layer of the last three convolution groups, CONV3_3, CONV4_3, CONV5_3, with PAC layers with the same configuration (filter size, input and output channels, etc.), and use the output of CONV2_2 as the guiding feature for the PAC layers. By this example, we also demonstrate that one could use earlier layer features (CONV2_2 here) as adapting features for PAC. Using this strategy, the network parameters do not increase when replacing CONV layers with PAC layers. All the layer weights are initialized with trained FCN parameters. To ensure a better starting condition for further training, we scale the guiding features by a small constant (0.0001) so that the PAC layers initially behave very closely to their original CONV counterparts. We use 8825 images for training, including the Pascal VOC 2011 training images and the additional training samples from [16]. Validation and testing are performed in the same fashion as in Sec. 5.
Results are reported in Tab. 4. We show that our simple modification (PACFCN) provides about $2$ mIoU improvement on test ($67.20\to 69.18$) for the semantic segmentation task, while incurring virtually no runtime penalty at inference time. Note that PACFCN has the same number of parameters as the original FCN model. The improvement brought by PACFCN is also complementary to any additional CRF postprocessing that can still be applied. After combined with a PACCRF (the 1664 variant) and trained jointly, we observe another $2$ mIoU improvement. Sample visual results are shown in Fig. 5.
Method  PACCRF  mIoU (val / test)  Runtime 
FCN8s    65.51 / 67.20  39 ms 
FCN8s  1664  68.90 / 69.82  117 ms 
PACFCN    67.44 / 69.18  41 ms 
PACFCN  1664  69.87 / 71.34  118 ms 
7 Conclusion
In this work we propose PAC, a new type of filtering operation that can effectively learn to leverage guidance information. We show that PAC generalizes several popular filtering operations and demonstrate its applicability on different uses ranging from joint upsampling, semantic segmentation networks, to efficient CRF inference. PAC generalizes standard spatial convolution, and can be used to directly replace standard convolution layers in pretrained networks for performance gain with minimal computation overhead.
Acknowledgements
Hang Su and Erik LearnedMiller acknowledge support from AFRL and DARPA (#FA87501820126)^{2}^{2} 2 The U.S. Gov. is authorized to reproduce and distribute reprints for Gov. purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the AFRL and DARPA or the U.S. Gov. and the MassTech Collaborative grant for funding the UMass GPU cluster.
References
 [1] A. Adams, J. Baek, and M. A. Davis. Fast highdimensional filtering using the permutohedral lattice. Computer Graphics Forum, 29(2):753–762, 2010.
 [2] V. Aurich and J. Weule. Nonlinear Gaussian filters performing edge preserving diffusion. In DAGM, pages 538–545. Springer, 1995.
 [3] S. P. Awate and R. T. Whitaker. Higherorder image statistics for unsupervised, informationtheoretic, adaptive, image filtering. In Proc. CVPR, volume 2, pages 44–51. IEEE, 2005.
 [4] S. Bako, T. Vogels, B. McWilliams, M. Meyer, J. Novák, A. Harvill, P. Sen, T. Derose, and F. Rousselle. Kernelpredicting convolutional networks for denoising monte carlo renderings. ACM Trans. Graph., 36(4):97, 2017.
 [5] J. T. Barron and B. Poole. The fast bilateral solver. In Proc. ECCV, pages 617–632. Springer, 2016.
 [6] A. Buades, B. Coll, and J.M. Morel. A nonlocal algorithm for image denoising. In Proc. CVPR, volume 2, pages 60–65. IEEE, 2005.
 [7] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, Proc. ECCV, Part IV, LNCS 7577, pages 611–625. SpringerVerlag, Oct. 2012.
 [8] S. Chandra and I. Kokkinos. Fast, exact and multiscale inference for semantic image segmentation with deep Gaussian CRFs. In Proc. ECCV, pages 402–418. Springer, 2016.
 [9] L.C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with taskspecific edge detection using CNNs and a discriminatively trained domain transform. In Proc. CVPR, pages 4545–4554, 2016.
 [10] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, 40(4):834–848, 2018.
 [11] L.C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In Proc. ICML, pages 1785–1794, 2015.
 [12] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. arXiv:1703.06211, 1(2):3, 2017.
 [13] R. Gadde, V. Jampani, M. Kiefel, D. Kappler, and P. V. Gehler. Superpixel convolutional networks using bilateral inceptions. In Proc. ECCV, pages 597–613. Springer, 2016.
 [14] E. S. Gastal and M. M. Oliveira. Domain transform for edgeaware image and video processing. ACM Trans. Graph., 30(4):69, 2011.
 [15] B. Ham, M. Cho, and J. Ponce. Robust image filtering using joint static and dynamic guidance. In Proc. CVPR, pages 4823–4831, 2015.
 [16] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In Proc. ICCV, 2011.
 [17] A. W. Harley, K. G. Derpanis, and I. Kokkinos. Segmentationaware convolutional networks using local attention masks. In Proc. ICCV, volume 2, page 7, 2017.
 [18] K. He, J. Sun, and X. Tang. Guided image filtering. PAMI, 35(6):1397–1409, 2013.
 [19] T.W. Hui, C. C. Loy, and X. Tang. Depth map superresolution by deep multiscale guidance. In Proc. ECCV, pages 353–369. Springer, 2016.
 [20] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. In Proc. CVPR, 2017.
 [21] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense CRFs and bilateral neural networks. In Proc. CVPR, pages 4452–4461, 2016.
 [22] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Proc. NIPS, pages 667–675, 2016.
 [23] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 [24] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. ACM Trans. Graph., 26(3):96, 2007.
 [25] P. Krähenbühl and V. Koltun. Efficient inference in fully connected CRFs with Gaussian edge potentials. In Proc. NIPS, pages 109–117, 2011.
 [26] P. Krähenbühl and V. Koltun. Parameter learning and convergent inference for dense random fields. In Proc. ICML, pages 513–521, 2013.
 [27] Y. Li, J.B. Huang, N. Ahuja, and M.H. Yang. Deep joint image filtering. In Proc. ECCV, pages 154–169. Springer, 2016.
 [28] Y. Li, J.B. Huang, N. Ahuja, and M.H. Yang. Joint image filtering with deep convolutional networks. PAMI, 2018.
 [29] Y. Li and R. Zemel. Mean field networks. arXiv:1410.5884, 2014.
 [30] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proc. CVPR, pages 3194–3203, 2016.
 [31] S. Liu, S. D. Mello, J. Gu, G. Zhong, M.H. Yang, and J. Kautz. Learning affinity via spatial propagation networks. In Proc. NIPS, 2017.
 [32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, pages 3431–3440, 2015.
 [33] S. Paris, P. Kornprobst, J. Tumblin, F. Durand, et al. Bilateral filtering: Theory and applications. Foundations and Trends® in Computer Graphics and Vision, 4(1):1–73, 2009.
 [34] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, and A. Ku. Image transformer. arXiv:1802.05751, 2018.
 [35] J.H. Rick Chang and Y.C. Frank Wang. Propagated image filtering. In Proc. CVPR, pages 10–18, 2015.
 [36] F. Saeedan, N. Weber, M. Goesele, and S. Roth. Detailpreserving pooling in deep networks. In cvpr, pages 9108–9116, 2018.
 [37] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multiclass object recognition and segmentation. In Proc. ECCV, 2006.
 [38] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Proc. ECCV, pages 746–760. Springer, 2012.
 [39] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.H. Yang, and J. Kautz. Splatnet: Sparse lattice networks for point cloud processing. In Proc. CVPR, 2018.
 [40] D. Sun, X. Yang, M.Y. Liu, and J. Kautz. PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. CVPR, pages 8934–8943, 2018.
 [41] M. T. T. Teichmann and R. Cipolla. Convolutional CRFs for semantic segmentation. arXiv:1805.04777, 2018.
 [42] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. ICCV, 1998.
 [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Proc. NIPS, pages 5998–6008, 2017.
 [44] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In Proc. CVPR, 2018.
 [45] N. Weber, M. Waechter, S. C. Amend, S. Guthe, and M. Goesele. Rapid, detailpreserving image downscaling. ACM Trans. Graph., 35(6):205, 2016.
 [46] H. Wu, S. Zheng, J. Zhang, and K. Huang. Fast endtoend trainable guided filter. In Proc. CVPR, pages 1838–1847, 2018.
 [47] J. Wu, D. Li, Y. Yang, C. Bajaj, and X. Ji. Dynamic sampling convolutional neural networks. arXiv:1803.07624, 2018.
 [48] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Proc. NIPS, pages 91–99, 2016.
 [49] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. arXiv:1511.07122, 2015.
 [50] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Proc. CVPR, pages 472–480, 2017.
 [51] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Selfattention generative adversarial networks. arXiv:1805.08318, 2018.
 [52] S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In Proc. ICCV, pages 1529–1537, 2015.
Appendix
In the appendix, we provide additional details and results on the deep joint upsampling experiments (Sec. A) and PACCRF (Sec. B).
Appendix A Deep Joint Upsampling with PAC
Network architecture.
Here we provide details of our network architectures used in the joint upsampling experiments. Our networks have three branches: Encoder, Guidance, and Decoder. The layers in each branch of the joint depth upsampling networks are listed in Tab. 5. Since we use each PAC${}^{\u22ba}$ for 2$\times $ upsampling, 4$\times $, 8$\times $, 16$\times $ networks requires 2, 3, 4 PAC${}^{\u22ba}$ layers respectively. The final output from the guidance branch is equally divided in the channel dimension for use as adapting features for the PAC${}^{\u22ba}$ layers in the decoder. All CONV and PAC${}^{\u22ba}$ layers use $5\times 5$ filters, and are followed by ReLU except for the last CONV. We use Gaussian kernels for $K$ in all PAC${}^{\u22ba}$ layers.
We design two variants of our model, standard and lite. The standard variant has a simpler design, but has varying number of parameters for different upsampling factors, and overall consume more memory than DJF [27], a previous stateoftheart approach on joint depth upsampling. For the lite variant, we reduce the number of filters and make sure the networks roughly match the number of parameters compared to DJF.
Similar network architectures are also used for optical flow upsampling. First layer of encoder and last layer in decoder are modified to fit the two $(u,v)$ channels in optical flow instead of one channel in depth maps, i.e., using “C2” instead of “C1” in Tab. 5.
standard  lite  
4$\times $  8$\times $  16$\times $  4$\times $  8$\times $  16$\times $  
Encoder  C32  C32  C32  C12  C12  C8 
C32  C32  C32  C16  C16  C16  
C32  C32  C32  C22  C16  C16  
Guidance  C32  C32  C32  C12  C12  C8 
C32  C32  C32  C22  C16  C16  
C32  C48  C64  C24  C36  C40  
Decoder  P32  P32  P32  P12  P12  P8 
P32  P32  P32  P16  P16  P16  
C32  P32  P32  C22  P16  P16  
C1  C32  P32  C1  C20  P16  
C1  C32  C1  C16  
C1  C1  
#Params  183K  222K  260K  56K  56K  56K 
Additional examples.
Appendix B Conditional Random Fields
Interpretations of the formulation.
The pairwise potentials in FullCRF is defined as ${\psi}_{p}({l}_{i},{l}_{j}I)=\mu ({l}_{i},{l}_{j})K({\mathbf{f}}_{i},{\mathbf{f}}_{j})$, where the kernel function $K$ has two terms, appearance kernel and smoothness kernel
$K({\mathbf{f}}_{i},{\mathbf{f}}_{j})=$  ${w}_{1}\underset{\text{appearance kernel}}{\underset{\u23df}{\mathrm{exp}\{{\displaystyle \frac{{\parallel {\mathbf{p}}_{i}{\mathbf{p}}_{j}\parallel}^{2}}{2{\theta}_{\alpha}^{2}}}{\displaystyle \frac{{\parallel {I}_{i}{I}_{j}\parallel}^{2}}{2{\theta}_{\beta}^{2}}}\}}}$  
$+{w}_{2}\underset{\text{smoothness kernel}}{\underset{\u23df}{\mathrm{exp}\{{\displaystyle \frac{{\parallel {\mathbf{p}}_{i}{\mathbf{p}}_{j}\parallel}^{2}}{2{\theta}_{\gamma}^{2}}}\}}}$  (8) 
In comparision, our pairwise potential uses (assuming using Guassian kernel and a single pairwise term)
${K}^{\prime}({\mathbf{f}}_{i},{\mathbf{f}}_{j})$  $=\mathbf{W}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]K({\mathbf{f}}_{i},{\mathbf{f}}_{j})$  
$=\mathbf{W}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]\mathrm{exp}\left\{{\displaystyle \frac{1}{2}}{\parallel {\mathbf{f}}_{i}{\mathbf{f}}_{j}\parallel}^{2}\right\}$  (9) 
There are two major differences:

1.
The smoothness kernel is now moved out of $K$ and is represented using filter $\mathbf{W}$. It can still be initialized as a Gaussian, but arbitrary filter is allowed to be learned.

2.
The appearance kernel now operates on $\mathbf{f}$ directly without the need of decomposing it into multiple parts, and without the individual scaling factors (${\theta}_{\alpha},\mathrm{\dots}$).
Both changes give the pairwise potential more learning capacity. Note that $\mathbf{f}$ can be the output of some other network layers. A simple linear layer can learn appropriate scaling factors, while in other cases a more complex network may be preferred. For input with more than RGB channels (e.g., 3D data with color, depth, normal, curvature, etc.), handcrafting and finding parameters for kernel functions like Eq. 8 can be timeconsuming and suboptimal, and allowing the function to be learned from data in an endtoend fashion is particularly desirable.
Note that in Eq. 9, $\mathbf{W}$ is a 2D matrix, and the corresponding pairwise potential is defined as
${\psi}_{p}({l}_{i},{l}_{j})=\mu ({l}_{i},{l}_{j})\mathbf{W}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]K({\mathbf{f}}_{i},{\mathbf{f}}_{j})$  (10) 
where $\mu ({l}_{i},{l}_{j})$ is the compatibility matrix. Our final pairwise potential, ${\psi}_{p}({l}_{i},{l}_{j})=K({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}_{j}{l}_{i}}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]$ , can be seen as a further step of generalization, where $\mathbf{W}$ is now a 4D tensor. Intuitively, this formulation allows the label compatibility pattern to be spatially varying across different pixel locations. Eq. 10 can be seen as a special case factorizing the 4D tensor as the product of two 2D matrices.
Meanfield inference derivation.
We will start from the meanfield update equation for general pairwise CRFs, Eq. 11. Detailed derivation for it can be found in Koller and Friedman [23, Chapter 11.5].
${Q}_{i}(l)={\displaystyle \frac{1}{{Z}_{i}}}\mathrm{exp}\{{\psi}_{u}(l){\displaystyle \sum _{j\in \mathrm{\Omega}(i)}}{\mathbf{E}}_{{l}_{j}\sim {Q}_{j}}{\psi}_{p}(l,{l}_{j})\}$  (11) 
Considering that we use multiple neighborhoods (with different dilation factors) in parallel, the update equation becomes
${Q}_{i}(l)={\displaystyle \frac{1}{{Z}_{i}}}\mathrm{exp}\{{\psi}_{u}(l){\displaystyle \sum _{k}}{\displaystyle \sum _{j\in {\mathrm{\Omega}}^{k}(i)}}{\mathbf{E}}_{{l}_{j}\sim {Q}_{j}}{\psi}_{p}^{k}(l,{l}_{j})\}$  (12) 
Substituting the pairwise potential with
${\psi}_{p}^{k}({l}_{i},{l}_{j})={K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}_{j}{l}_{i}}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]$  (13) 
the update rule becomes
${Q}_{i}(l)=$  $\frac{1}{{Z}_{i}}}\mathrm{exp}\{\psi {}_{u}(l)$  
${\displaystyle \sum _{k}}{\displaystyle \sum _{j\in {\mathrm{\Omega}}^{k}(i)}}{\mathbf{E}}_{{l}_{j}\sim {Q}_{j}}\{{K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}_{j}l}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]\}\}$  
$=$  $\frac{1}{{Z}_{i}}}\mathrm{exp}\{\psi {}_{u}(l)$  
${\displaystyle \sum _{k}}{\displaystyle \sum _{{l}^{\prime}\in \mathcal{L}}}{\displaystyle \sum _{j\in {\mathrm{\Omega}}^{k}(i)}}{K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}^{\prime}l}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]{Q}_{j}({l}^{\prime})\}$  (14) 
Using Eq. 14 in an iterative fashion leads to the final update rule of meanfield inference:
${Q}_{i}^{(t+1)}(l)\leftarrow {\displaystyle \frac{1}{{Z}_{i}}}\mathrm{exp}\{\psi {}_{u}(l)$  
${\displaystyle \sum _{k}}\underset{\text{PAC}}{\underset{\u23df}{{\displaystyle \sum _{{l}^{\prime}\in \mathcal{L}}}{\displaystyle \sum _{j\in {\mathrm{\Omega}}^{k}(i)}}{K}^{k}({\mathbf{f}}_{i},{\mathbf{f}}_{j}){\mathbf{W}}_{{l}^{\prime}l}^{k}[{\mathbf{p}}_{j}{\mathbf{p}}_{i}]{Q}_{j}^{(t)}({l}^{\prime})}}\}$  (15) 
Meanfield inference steps.
Tab. 6 shows how mIoU changes with different meanfield steps. We use 5 steps for all other experiments in the paper.
Meanfield steps  1  3  5  7 
mIoU  68.38  68.72  68.90  68.90 
time  19 ms  49 ms  78 ms  109 ms 
On the contribution of dilation.
Just like standard convolution, PAC supports dilation to increase the receptive field without increasing the number of parameters. This capability is leveraged by PACCRF to allow longrange connections. For a similar purpose, ConvCRF applies Gaussian blur to pairwise potentials to increase the receptive field. To quantify the improvements due to dilation, we try another baseline where we add dilation to ConvCRF. The improved performance (+2.13/+1.57 $\to $ +2.50/+1.91) validates that dilation is indeed an important ingredient, while the remaining gap shows that the PAC formulation is essential to the full gain.