On Single Source Robustness in Deep Fusion Models

  • 2019-06-11 16:47:56
  • Taewan Kim, Joydeep Ghosh
  • 2

Abstract

Algorithms that fuse multiple input sources benefit from both complementaryand shared information. Shared information may provide robustness to faulty ornoisy inputs, which is indispensable for safety-critical applications likeself-driving cars. We investigate learning fusion algorithms that are robustagainst noise added to a single source. We first demonstrate that robustnessagainst single source noise is not guaranteed in a linear fusion model.Motivated by this discovery, two possible approaches are proposed to increaserobustness: a carefully designed loss with corresponding training algorithmsfor deep fusion models, and a simple convolutional fusion layer that has astructural advantage in dealing with noise. Experimental results show that bothtraining algorithms and our fusion layer make a deep fusion-based 3D objectdetector robust against noise applied to a single source, while preserving theoriginal performance on clean data.

 

Quick Read (beta)

On Single Source Robustness in Deep Fusion Models

Taewan Kim
The University of Texas at Austin
Austin, TX 78712
[email protected] &Joydeep Ghosh
The University of Texas at Austin
Austin, TX 78712
[email protected]
Abstract

Algorithms that fuse multiple input sources benefit from both complementary and shared information. Shared information may provide robustness to faulty or noisy inputs, which is indispensable for safety-critical applications like self-driving cars. We investigate learning fusion algorithms that are robust against noise added to a single source. We first demonstrate that robustness against single source noise is not guaranteed in a linear fusion model. Motivated by this discovery, two possible approaches are proposed to increase robustness: a carefully designed loss with corresponding training algorithms for deep fusion models, and a simple convolutional fusion layer that has a structural advantage in dealing with noise. Experimental results show that both training algorithms and our fusion layer make a deep fusion-based 3D object detector robust against noise applied to a single source, while preserving the original performance on clean data.

 

On Single Source Robustness in Deep Fusion Models


  Taewan Kim The University of Texas at Austin Austin, TX 78712 [email protected] Joydeep Ghosh The University of Texas at Austin Austin, TX 78712 [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Deep learning models have accomplished superior performance in several machine learning problems (LeCun et al., 2015) including object recognition (Krizhevsky et al., 2012; Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017), object detection (Ren et al., 2015; He et al., 2017; Dai et al., 2016; Redmon et al., 2016; Liu et al., 2016; Redmon and Farhadi, 2017) and speech recognition (Hinton et al., 2012; Graves et al., 2013; Sainath et al., 2013; Chorowski et al., 2015; Chan et al., 2016; Chiu et al., 2018), which use either visual or audio sources. One natural way of improving a model’s performance is to make use of multiple input sources relevant to a given task so that enough information can be extracted to build strong features. Therefore, deep fusion models have recently attracted considerable attention for autonomous driving (Kim and Ghosh, 2016; Chen et al., 2017; Qi et al., 2018; Ku et al., 2018), medical imaging (Kiros et al., 2014; Wu et al., 2013; Simonovsky et al., 2016; Liu et al., 2015), and audio-visual speech recognition (Huang and Kingsbury, 2013; Mroueh et al., 2015; Sui et al., 2015; Chung et al., 2017).

Two benefits are expected when fusion-based learning models are selected for a given problem. First, given adequate data, more information from multiple sources can enrich the model’s feature space to achieve higher prediction performance, especially, when different input sources provide complementary information to the model. This expectation coincides with a simple information theoretic fact: if we have multiple input sources X1,,Xm and a target variable Y, mutual information I(;) obeys I(Y;X1,,Xm)I(Y;Xi)(i[m]).

The second expected advantage is increased robustness against single source faults, which is the primary concern of our work. An underlying intuition comes from the fact that different sources may have shared information so one sensor can partially compensate for others. This type of robustness is critical in real-world fusion models, because each source may be exposed to different types of corruption but not at the same time. For example, LIDARs used in autonomous vehicles work fine at night whereas RGB cameras do not. Also, each source used in the model may have its own sensing device, and hence not necessarily be corrupted by some physical attack simultaneously with others. It would be ideal if the structure of machine learning based fusion models and shared information could compensate for the corruption and automatically guarantee robustness without additional steps.

This paper shows that a fusion model needs a supplementary strategy and a specialized structure to avoid vulnerability to noise or corruption on a single source. Our contributions are as follows:

  • We show that a fusion model learned with a standard robustness is not guaranteed to provide robustness against noise on a single source. Inspired by the analysis, a novel loss is proposed to achieve the desired robustness (Section 3).

  • Two efficient training algorithms for minimizing our loss in deep fusion models are devised to ensure robustness without impacting performance on clean data (Section 4.1).

  • We introduce a simple but an effective fusion layer which naturally reduces error by applying ensembling to latent convolutional features (Section 4.2).

We apply our loss and the fusion layer to a complex deep fusion-based 3D object detector used in autonomous driving for further investigation in practice. Note that our findings can be easily generalized to other applications exhibiting intermittent defects in a subset of input sources.

2 Related Works

Deep fusion models have been actively studied in object detection for autonomous vehicles. There exist two major streams classified according to their algorithmic structures: two-stage detectors with R-CNN (Region-based Convolutional Neural Networks) technique (Girshick et al., 2014; Girshick, 2015; Ren et al., 2015; Dai et al., 2016; He et al., 2017), and single stage detectors for faster inference speed (Redmon et al., 2016; Redmon and Farhadi, 2017; Liu et al., 2016).

Earlier deep fusion models extended Fast R-CNN (Girshick, 2015) to provide better quality of region proposals from multiple sources (Kim and Ghosh, 2016; Braun et al., 2016). With a high-resolution LIDAR, point cloud was used as a major source of the region proposal stage before the fusion step (Du et al., 2017), whereas F-PointNet (Qi et al., 2018) used it for validating 2D proposals from RGB images and predicting 3D shape and location within the visual frustum. MV3D (Chen et al., 2017) extended the idea of region proposal network (RPN) (Ren et al., 2015) by generating proposals from RGB image, and LIDAR’s front view and BEV (bird’s eye view) maps. Recent works tried to remove region proposal stages for faster inference and directly fused LIDAR’s front view depth image (Kim et al., 2018b) or BEV image (Wang et al., 2018) with RGB images. ContFuse (Liang et al., 2018) utilizes both RGB and LIDAR’s BEV images with a new continuous fusion scheme, which is further improved in MMF (Liang et al., 2019) by handling multiple tasks at once. Our experimental results are based on AVOD (Ku et al., 2018), a recent open-sourced 3D object detector that generates region proposals from RPN using RGB and LIDAR’s BEV images.

Compared to the active efforts in accomplishing higher performance on clean data, very few works have focused on robust learning methods in multi-source settings to the best of our knowledge. Adaptive fusion methods using gating networks weight the importance of each source automatically (Mees et al., 2016; Valada et al., 2017), but these works lack in-depth studies of the robustness against single source faults. A recent work proposed a gated fusion at the feature level and applied data augmentation techniques with randomly chosen corruption methods (Kim et al., 2018a). In contrast, our training algorithms are surrogate minimization schemes for the proposed loss function, which is grounded from the analyses on underlying weakness of fusion methods. Also the fusion layer proposed in this paper focuses more on how to mix convolutional feature maps channel-wise with simple trainable procedures. For extensive literature reviews, please refer to the recent survey papers about deep multi-modal learning methods in general (Ramachandram and Taylor, 2017) and for autonomous driving (Feng et al., 2019).

3 Single Source Robustness of Fusion Models

3.1 Regression on linear fusion data

To show the vulnerability of naive fusion models, we introduce a simple data model and a fusion algorithm. Suppose y is a linear function consisting of three different inherent (latent) components zidi (i{1,2,3}). There are two input sources, x1 and x2. Here ψ’s are unknown functions.

y=i=13βiTzi, where z1=ψ1(x1),z2=ψ2(x2),z3=ψ3,1(x1)=ψ3,2(x2) (1)

Our simple data model simulates a target variable y relevant to two different sources, where each source has its own special information z1 and z2 and a shared one z3. For example, if two sources are obtained from an RGB camera and a LIDAR sensor, one can imagine that any features related to objectness are captured in z3 whereas colors and depth information may be located in z1 and z2, respectively. Our objective is to build a regression model by effectively incorporating information from the sources (x1,x2) to predict the target variable y.

Now, consider a fairly simple setting x1=[z1;z3]d1+d3 and x2=[z2;z3]d2+d3, where (ψ1,ψ2,ψ3,1,ψ3,2) can be defined accordingly to satisfy (1). A straightforward fusion approach is to stack the sources, i.e. x=[x1;x2]d1+d2+2d3, and learn a linear model. Then, it is easy to show that there exists a feasible error-free model for noise-free data:

fdirect(x1,x2)=h1Tx1+h2Tx2=(β1Tz1+g1Tz3)+(β2Tz2+g2Tz3),s.t.g1+g2=β3 (2)

where h1=[β1;g1],h2=[β2;g2]. Parameter vectors responsible for the shared information z3 are denoted by g1 and g2.11 1 In practice, Y=[X1,X2][h1h2] has to be solved for X1n×(d1+d3),X2n×(d2+d3) and Yn with enough number of n data samples. Then a standard least squares solution using a pseudo-inverse gives h1=[β1;β3/2],h2=[β2;β3/2]. This is equivalent to the solution robust against random noise added to all the sources at once, which is vulnerable to single source faults (Section 3.2).

Suppose the true parameters of data satisfy ||β1||2||β2||2 and ||β3||2||β1||2. Assume that the obtained solution’s parameters for z3 are unbalanced, i.e. g1=Δ and g2=β3-Δ with some weight vector Δ having a small norm. Then adding noise to the source x2 will give significant corruption to the prediction while x1 is relatively robust because |(β3-Δ)Tϵ3||ΔTϵ3| for any noise ϵ3 affecting z3. This simple example illustrates that additional training strategies or components are indispensable to achieve robust fusion model working even if one of the sources is disturbed. The next section introduces a novel loss for a balanced robustness against a fault in a single source.

3.2 Robust learning for single source noise

Fusion methods are not guaranteed to provide robustness against faults in a single source without additional supervision. Also, we demonstrate that naive regularization or robust learning methods are not sufficient for the robustness later in this section. Therefore, a supplementary constraint or strategy needs to be considered in training which can correctly guide learning parameters for the desired robustness.

One essential requirement of fusion models is showing balanced performance regardless of corruption added to any source. If the model is significantly vulnerable to corruption in one source, this model becomes untrustworthy and we need to balance the degradation levels of different input sources’ faults. For example, suppose there is a model robust against noise in RGB channels but shows huge degradation in performance for any fault of LIDAR. Then the overall system should be considered untrustworthy, because there exist certain corruption or environments which can consistently fool the model. Our loss, MaxSSN (Maximum Single Source Noise), for such robustness is introduced to handle this issue and further analyses are provided under the linear fusion data model explained in Section 3.1. This loss makes the model focus more on corruption of a single source, SSN, rather than focusing on noise added to all the sources at once, ASN.

Definition 1.

For multiple sources x1,,xns and a target variable y, denote a predefined loss function by L. If each source xi is perturbed with some additive noise ϵi for i[ns], MaxSSN loss for a model f is defined as follows:

MaxSSN(f,ϵ)maxi{(y,f(x1,,xi-1,xi+ϵi,xi+1,,xns))}i=1ns

Another key principle in our robust training is to retain the model’s performance on clean data. Although techniques like data augmentation help improving a model’s generalization error in general, learning a model robust against certain types perturbation including adversarial attacks may harm the model’s accuracy on non-corrupt data (Tsipras et al., 2019). Deterioration in the model’s ability on normal data is an unwanted side effect, and hence our approach aims to avoid this.

Random noise

To investigate the importance of our MaxSSN loss, we revisit the linear fusion data model with the optimal direct fusion model fdirect of the regression problem introduced in Section 3.1. Suppose the objective is to find a model with robustness against single source noises, while preserving error-free performance, i.e., unchanged loss under clean data. For the noise model, consider ϵ=[δ1;δ2] where δ1=[ϵ1;ϵ3] and δ2=[ϵ2;ϵ4], which satisfy 𝔼[ϵi]=0, Var(ϵi)=σ2I, and 𝔼[ϵiϵjT]=0 for ij. Note that noises added to the shared information, ϵ3 and ϵ4, are not identical, which resembles direct perturbation to the input sources in practice. For example, noise directly affecting a camera lens does not need to perturb other sources.

Optimal fusion model for MaxSSN

The robust linear fusion model f(x1,x2)=(w1Tz1+g1Tz3)+(w2Tz1+g2Tz3) is found by minimizing MaxSSN(f,ϵ) over parameters w1,w2,g1 and g2. As shown in the previous section, any fdirect satisfying w1=β1,w2=β2 and g1+g2=β3 should achieve zero-error. Therefore, overall optimization problem can be reduced to the following one:

ming1,g2max{(y,fdirect(x1+δ1,x2)),(y,fdirect(x1,x2+δ2))}s.t.g1+g2=β3 (3)

If we use a standard expected squared loss (y,f(x1,x2))=𝔼[(y-f(x1,x2))2] and solve the optimization problem, the following solution MaxSSN* with corresponding parameters g1*,g2* can be obtained, and there exist three cases based on the relative sizes of ||βi||2’s.

(MaxSSN*,g1*,g2*)={(σ2||β2||22,β3,0)if ||β1||22+||β3||22||β2||22(σ2||β1||22,0,β3)if ||β2||22+||β3||22||β1||22(σ2(||β1||22+||β2||222+||β3||224+(||β2||22-||β1||22)24||β3||22),12(1+||β2||22-||β1||2||β3||22),12(1-||β2||22-||β1||2||β3||22))  otherwise (4)

The three cases reflect the relative influence of each weight vector for zi. For instance, if z2 has larger importance compared to the rest in generating y, the optimal way of balancing the effect of noise over z3 is to remove all the influence of z2 in x2 by setting g2=0. When neither of z1 nor z2 dominates the importance, i.e. |||β2||22-||β1||22||β3||22|<1, the optimal solution tries to make (y,fdirect(x1+δ1,x2))=(y,fdirect(x1,x2+δ2)).

Comparison with the standard robust fusion model

Minimizing loss with noise added to a model’s input is a standard process in robust learning. The same strategy can be applied to learn fusion models by considering all sources as a single combined source, then add noise to all the sources at once. However, this simple strategy cannot achieve low error in terms of the single source robustness. The optimal solution to ming1,g2𝔼[(y-fdirect(x1+δ1,x2+δ2))2], a least squares solution, is achieved when g1=g2=β32. The corresponding MaxSSN loss can be evaluated as MaxSSN=σ2max{||β1||22+14||β3||22,||β2||22+14||β3||22}. A nontrivial gap exists between MaxSSN and MaxSSN, which is directly proportional to the data model’s inherent characteristics:

MaxSSN-MaxSSN*{14||β3||22if |||β2||22-||β1||22||β3||22|114|||β2||22-||β1||22|otherwise (5)

If either z1 or z2 has more influence on the target value y than the other components, single source robustness of the model trained by MaxSSN loss is better than the fusion model for the general noise robustness with an amount proportional to the influence of shared feature z3. Otherwise, the gap’s lower bound is proportional to the difference in complementary information, |||β2||22-||β1||22|/4.

Remark 1.

In linear systems such as the one studied above, having redundant information in the feature space is similar to multicollinearity in statistics. In this case, feature selection methods usually try to remove such redundancy. However, this redundant or shared information helps preventing degradation of the fusion model when a subset of the input sources are corrupted.

Remark 2.

Similar analyses and a loss definition against adversarial attacks (Goodfellow et al., 2015) are provided in appendix A.2.

4 Robust Deep Fusion Models

In simple linear settings, our analyses illustrate that using MaxSSN loss can effectively minimize the degradation of a fusion model’s performance. This suggests a training strategy for complex deep fusion models to be equipped with robustness against single source faults. A principal factor considered in designing a common framework for our algorithms is the preservation of model’s performance on clean data while minimizing a loss for defending corruption. Therefore, our training algorithms use data augmentation to encounter both clean and corrupted data. The second way of achieving robustness is to take advantage of the fusion method’s structure. A simple but effective method of mixing convolutional features coming from different input sources is introduced later in this section.

4.1 Robust training algorithms for single source noise

Our common training framework alternately provides clean samples and corrupted samples per iteration to preserve the original performance of the model on uncontaminated data.22 2 We also try fine-tuning only a subset of the model’s parameters, 𝜽fusionf, to preserve essential parts for extracting features from normal data. However, training the whole network from the beginning shows better performance in practice. See Appendix B for a detailed comparison. On top of this strategy, one standard robust training scheme and two algorithms for minimizing MaxSSN  loss are introduced for handling robustness against noise in different sources.

Standard robust training method

A standard robust training algorithm can be developed by considering all ns sources as a single combined source. Given noise generating functions φi() (i[ns]), the algorithm generates and adds corruption to all the sensors at once. Then the corresponding loss can be computed to update parameters using back-propagation. This algorithm is denoted by TrainASN and tested in experiments to investigate whether the procedure is also able to cover robustness against single source noise.

{algorithm} [H] \[email protected]@algorithmic \FORiiter=1 to m \STATESample (y,{xi}i=1ns) \IFiiter1 (mod 2) \FORj=1 to ns \STATEGenerate noise ϵj=φj(xj) \STATE^j(iiter)(y,f({xj+ϵj,x-j}))\ENDFOR\STATE (iiter)maxj^(y,f({xj+ϵj,x-j})) \ELSE\STATE(iiter)(y,f({xi}i=1ns)) \ENDIF\STATEUpdate f using (iiter) \ENDFOR
Figure 1: TrainSSN
{algorithm} [H] \[email protected]@algorithmic \FORiiter=1 to m \STATESample (y,{xi}i=1ns) \IFiiter1 (mod 2) \STATEj(iiter/2 mod ns)+1 \STATEGenerate noise ϵj=φj(xj) \STATE(iiter)(y,f({xj+ϵj,x-j})) \ELSE\STATE(iiter)(y,f({xi}i=1ns)) \ENDIF\STATEUpdate f using (iiter) \ENDFOR
Figure 2: TrainSSNAlt

Minimization of MaxSSN loss

Minimization of the MaxSSN loss requires ns (number of input sources) forward-propagations within one iteration. Each propagation needs a different set of corrupted samples generated by adding single source noise to the fixed clean mini-batch of data. There are two possible approaches to compute gradients properly from these multiple passes. First, we can run back-propagation ns times to save the gradients temporarily without updating any parameters, then the saved gradients with the maximum loss is used for updating parameters. However, this process requires not only ns forward and backward passes but also large memory usage proportional to ns for saving the gradients. Another reasonable approach is to run ns forward passes to find the maximum loss and compute gradients by going back to the corresponding set of corrupted samples. Algorithm 2 adopts this idea for its efficiency, ns+1 forward passes and one back-propagation. A faster version of the algorithm, TrainSSNAlt, is also considered since multiple forward passes may take longer as the number of sources increases. This algorithm ignores the maximum loss and alternately augments corrupted data. By a slight abuse of notation, symbols used in our algorithms also represent the iteration steps with the size of mini-batches greater than one. Also, f(x1,,xj-1,xj+ϵj,xj+1,,xns) is shortened to f({xj+ϵj,x-j}) in the algorithms.

4.2 Feature fusion methods

Fusion of features extracted from multiple input sources can be done in various ways (Chen et al., 2017). One of the popular methods is to fuse via an element-wise mean operation (Ku et al., 2018), but this assumes that each feature must have a same shape, i.e., width, height, and number of channels for a 3D feature. An element-wise mean can be also viewed as averaging channels from different 3D features, and it has an underlying assumption that the channels of each feature should share same information regardless of the input source origin. Therefore, the risk of becoming vulnerable to single source corruption may increase with this simple mean fusion method.

Figure 3: Latent ensemble layer (LEL)

Our fusion method, latent ensemble layer (LEL), is devised for three objectives: (i) maintaining the known advantage—error reduction—of ensemble methods (Tumer and Ghosh, 1996b, a), (ii) admitting source-specific features to survive even after the fusion procedure, and (iii) allowing each source to provide a different number of channels. The proposed layer learns parameters so that channels of the 3D features from the different sources can be selectively mixed. Sparse constraints are introduced to let the training procedure find good subsets of channels to be fused across the ns feature maps. For example, mixing the ith channel of the convolutional feature from an RGB image with the jth and kth channels of the LIDAR’s latent feature is possible in our LEL, whereas in an element-wise mean layer the ith latent channel from RGB is only mixed with the other sources’ ith channels.

We also apply an activation function to supplement a semi-adaptive behavior to the fusion procedure. Definition 2 explains the details of our LEL, and Figure 3 visualizes the overall process. In practice, this layer can be easily constructed by using 1×1 convolutions with the ReLU activation and 1 constraints. The output channel-depth is set to d^=maxi{di} in the experiments.

Definition 2 (Latent ensemble layer).

Suppose we have ns convolutional features ziRa×b×di from different input sources (i[ns]), which can be stacked as z=(z1,,zm)Ra×b×d𝑠𝑢𝑚 (d𝑠𝑢𝑚=i=1mdi). The kth channel of the stacked feature is denoted by [z]kRa×b. Let 𝐰j=(w1(j),,wd𝑠𝑢𝑚(j)) be a d𝑠𝑢𝑚-dimensional weight vector to mix zi’s in channel-wise fashion. Then LEL outputs z^Ra×b×d^ where each channel is computed as [z^]j=ϕ(𝐰jz)ϕ(k=1d𝑠𝑢𝑚wk(j)[z]k), with some activation function ϕ and sparse constraints ||𝐰j||0t for all j{1,,d^}.

5 Experimental Results

We test our algorithms and the LEL fusion method on 3D and BEV object detection tasks using the car class of the KITTI dataset (Geiger et al., 2012). As our experiments include random generation of corruption, each task is evaluated 5 times to compare average scores (reported with 95% confidence intervals), and thus a validation set is used for ease of manipulating data and repetitive evaluation. We follow the split of (Ku et al., 2018), 3712 and 3769 frames for training and validation sets, respectively. Results are reported based on three difficulty levels defined by KITTI (easy, medium, hard) and a standard metric for object detection Average Precision (AP) is used. A recent open-sourced 3D object detector AVOD (Ku et al., 2018) with a feature pyramid network is selected as a baseline algorithm. Four different algorithms are compared: AVOD trained on (i) clean data, (ii) data augmented with ASN samples (TrainASN), (iii) SSN augmented data with direct MaxSSN loss minimization (TrainSSN), and (iv) SSN augmented data (TrainSSNAlt). The AVOD architecture is varied to use either element-wise mean fusion layers or our LELs. We follow the original training setups of AVOD, e.g., 120k iterations using an ADAM optimizer with an initial learning rate of 0.0001.33 3 Our methods are implemented with TensorFlow on top of the official AVOD code. The computing machine has a Intel Xeon E5-1660v3 CPU with Nvidia Titan X Pascal GPUs. The source code is available at: https://github.com/twankim/avod_ssn

(a) Original
(b) Gaussian noise
(c) Downsampling
Figure 4: Visualization of corrupted samples, (top) RGB images (bottom) LIDAR point clouds. The points clouds are projected onto the 2D image plane for easier visual comparison.

Corruption methods

Gaussian noise generated i.i.d. with 𝒩(0,σGaussian2) is directly added to the pixel value of an image (r,g,b) and the coordinate value of a LIDAR’s point (x,y,z). σGaussian is set to 0.75τ experimentally with τRGB=255 and τLIDAR=0.2. The second method downsampling selects only 16 out of 64 lasers of LIDAR data. To match this effect, 3 out of 4 horizontal lines of an RGB image are deleted. Effects of corruption on each input source are visualized in Figure 4, where the color of a 2D LIDAR image represents a distance from the sensor. Although our analyses in Section 3.2 assume the noise variances to be identical, it is nontrivial to set equal noise levels for different modalities in practice, e.g., RGB pixels vs points in a 3D space. Nevertheless, an underlying objective of our MaxSSN loss, balancing the degradation rates of different input sources’ faults, does not depend on the choice of noise types or levels.

Evaluation metrics for single source robustness

To assess the robustness against single source noise, a new metric minAP is introduced. The AP score is evaluated on the dataset with a single corrupted input source, then after going over all ns sources, minAP reports the lowest score among the ns AP scores. Our second metric maxDiffAP computes the maximum absolute difference among the scores, which measures the balance of different input sources’ single source robustness; low value of maxDiffAP means the well-balanced robustness.

Table 1: Car detection (3D/BEV) performance of AVOD with element-wise mean fusion layers and latent ensemble layers (LEL) against Gaussian SSN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
Fusion method: Mean
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 76.41 72.74 66.86 89.33 86.49 79.44
+TrainASN 75.96 66.68 65.97 88.63 79.45 78.79
+TrainSSN 76.28 67.10 66.51 88.86 79.60 79.11
+TrainSSNAlt 77.46 67.61 66.06 89.68 86.71 79.41
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 47.41±0.28 41.84±0.17 36.47±0.16 65.63±0.28 58.02±0.23 50.43±0.14
+TrainASN 61.53±0.57 52.72±0.08 47.25±0.13 87.71±0.14 78.37±0.06 77.85±0.08
+TrainSSN 71.65±0.31 62.14±0.08 56.78±0.12 88.21±0.08 78.90±0.09 77.92±0.11
+TrainSSNAlt 71.66±0.48 57.61±0.12 55.90±0.11 89.42±0.04 79.56±0.06 77.92±0.05
(Gaussian SSN) maxDiffAP3D(%) maxDiffAPBEV(%)
AVOD (Ku et al., 2018) 26.70±0.52 22.42±0.29 20.92±0.25 22.27±0.41 20.76±0.33 20.09±0.20
+TrainASN 14.48±0.82 12.72±0.33 11.18±0.27 0.88±0.22 0.48±0.13 0.28±0.12
+TrainSSN 3.71±0.46 3.42±0.25 7.50±0.25 0.36±0.17 0.04±0.15 0.71±0.17
+TrainSSNAlt 5.55±0.81 8.73±0.32 2.91±0.22 0.09±0.14 0.13±0.11 0.18±0.11
Fusion method: Latent Ensemble Layer
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 77.79 67.69 66.31 88.90 85.64 78.86
+TrainASN 75.00 64.75 58.28 88.30 78.60 77.23
+TrainSSN 74.25 65.00 63.83 87.88 78.84 77.66
+TrainSSNAlt 76.04 66.42 64.41 88.80 79.53 78.53
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 61.97±0.55 53.95±0.42 47.24±0.27 79.44±0.09 72.46±3.14 68.25±0.06
+TrainASN 74.24±0.38 58.25±0.16 56.13±0.10 88.10±0.26 78.19±0.13 70.42±0.07
+TrainSSN 68.16±0.88 60.39±0.38 56.04±0.28 88.12±0.16 78.17±0.06 70.21±0.05
+TrainSSNAlt 68.63±0.40 55.48±0.16 54.42±0.17 86.51±0.46 76.85±0.11 71.95±2.72
Table 2: Car detection (3D/BEV) performance of AVOD with latent ensemble layers (LEL) against downsampling SSN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 77.79 67.69 66.31 88.90 85.64 78.86
+TrainASN 71.74 61.78 60.26 87.29 77.08 75.89
+TrainSSN 75.54 66.26 63.72 88.07 79.18 78.03
+TrainSSNAlt 76.22 66.05 63.87 89.00 79.65 78.03
(Downsample SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 61.70 51.66 46.17 86.08 69.99 61.55
+TrainASN 65.74 53.49 51.35 82.27 67.88 65.79
+TrainSSN 73.33 57.85 54.91 86.61 76.07 68.59
+TrainSSNAlt 64.77 53.34 48.29 85.27 69.87 67.77

Results

When the fusion model uses the element-wise mean fusion (Table 1), TrainSSN algorithm shows the best single source robustness against Gaussian SSN while preserving the original performance on clean data (only small decrease in the moderate BEV detection). Also a balance of the both input sources’ performance is dramatically decreased compared to the models trained without robust learning and a naive TrainASN method.

Encouragingly, AVOD model constructed with our LEL method already achieves relatively high robustness without any robust learning strategies compared to the mean fusion layers. For all the tasks, minAP scores are dramatically increased, e.g., 61.97 vs. 47.41 minAP for the easy 3D detection task, and the maxDiffAP scores are decreased (maxDiffAP scores for AVOD with LEL are reported in Appendix B.). Then the robustness is further improved by minimizing our MaxSSN loss. As our LEL’s structure inherently handles corruption on a single source well, even the TrainASN algorithm can successfully guide the model to be equipped with the desired robustness. A corruption method with a different style, downsampling, is also tested with our LEL. Table 2 shows that the model achieves the best performance among the four algorithms when trained with our TrainSSN.

Remark 3.

A simple TrainSSNAlt achieves fairly robust models in both fusion methods against Gaussian noise, and two reasons may explain this phenomenon. First, all parameters are updated instead of fine-tuning only fusion related parts. Therefore, unlike our analyses on the linear model, the latent representation can be transformed to meet the objective function. In fact, TrainSSNAlt performs poorly when we fine-tune the model with concatenation fusion layers as shown in the supplement. Secondly, the loss function inside our MaxSSN is usually non-convex so that it may be enough to use an indirect approach for small number of sources, ns=2.

6 Conclusion

We study two strategies to improve robustness of fusion models against single source corruption. Motivated by analyses on linear fusion models, a loss function is introduced to balance performance degradation of deep fusion models caused by corruption in different sources. We also demonstrate the importance of a fusion method’s structure by proposing a simple ensemble layer achieving such robustness inherently. Our experimental results show that deep fusion models can effectively use complementary and shared information of different input sources by training with our loss and fusion layer to obtain both robustness and high accuracy. We hope our results motivate further work to improve the single source robustness of more complex fusion models with either large number of input sources or adaptive networks. Another interesting direction is to investigate the single source robustness against adversarial attacks in deep fusion models, which can be compared with our analyses in the supplementary material.

References

  • Braun et al. [2016] Markus Braun, Qing Rao, Yikang Wang, and Fabian Flohr. Pose-rcnn: Joint object detection and pose estimation using 3d object proposals. In IEEE 19th international conference on intelligent transportation systems (ITSC), pages 1546–1551, 2016.
  • Chan et al. [2016] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4960–4964, 2016.
  • Chen et al. [2017] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1907–1915, 2017.
  • Chiu et al. [2018] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4774–4778, 2018.
  • Chorowski et al. [2015] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Advances in neural information processing systems (NeurIPS), pages 577–585, 2015.
  • Chung et al. [2017] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453, 2017.
  • Dai et al. [2016] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (NeurIPS), pages 379–387, 2016.
  • Du et al. [2017] Xinxin Du, Marcelo H Ang, and Daniela Rus. Car detection for autonomous vehicle: Lidar and vision fusion approach through deep learning framework. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 749–754, 2017.
  • Feng et al. [2019] Di Feng, Christian Haase-Schuetz, Lars Rosenbaum, Heinz Hertlein, Fabian Duffhauss, Claudius Glaeser, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. arXiv preprint arXiv:1902.07830, 2019.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE conference on computer vision and pattern recognition (CVPR), pages 3354–3361, 2012.
  • Girshick [2015] Ross Girshick. Fast r-cnn. In IEEE international conference on computer vision (ICCV), pages 1440–1448, 2015.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014.
  • Goodfellow et al. [2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International conference on learning representations (ICLR), 2015.
  • Graves et al. [2013] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6645–6649, 2013.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE international conference on computer vision (ICCV), pages 2961–2969, 2017.
  • Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE signal processing magazine, 29, 2012.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR), pages 4700–4708, 2017.
  • Huang and Kingsbury [2013] Jing Huang and Brian Kingsbury. Audio-visual deep learning for noise robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7596–7599, 2013.
  • Kim et al. [2018a] Jaekyum Kim, Junho Koh, Yecheol Kim, Jaehyung Choi, Youngbae Hwang, and Jun Won Choi. Robust deep multi-modal learning based on gated information fusion network. In Asian conference on computer vision (ACCV), 2018a.
  • Kim and Ghosh [2016] Taewan Kim and Joydeep Ghosh. Robust detection of non-motorized road users using deep learning on optical and lidar data. In IEEE 19th international conference on intelligent transportation systems (ITSC), pages 271–276, 2016.
  • Kim et al. [2018b] Taewan Kim, Michael Motro, Patrícia Lavieri, Saharsh Samir Oza, Joydeep Ghosh, and Chandra Bhat. Pedestrian detection with simplified depth prediction. In IEEE 21st international conference on intelligent transportation systems (ITSC), pages 2712–2717, 2018b.
  • Kiros et al. [2014] Ryan Kiros, Karteek Popuri, Dana Cobzas, and Martin Jagersand. Stacked multiscale feature learning for domain independent medical image segmentation. In International workshop on machine learning in medical imaging, pages 25–32. Springer, 2014.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS), pages 1097–1105, 2012.
  • Ku et al. [2018] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 1–8, 2018.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • Liang et al. [2018] Ming Liang, Bin Yang, Shenlong Wang, and Raquel Urtasun. Deep continuous fusion for multi-sensor 3d object detection. In European conference on computer vision (ECCV), pages 641–656, 2018.
  • Liang et al. [2019] Ming Liang, Bin Yang, Yun Chen, Rui Hui, and Raquel Urtasun. Multi-task multi-sensor fusion for 3d object detection. In IEEE conference on computer vision and pattern recognition (CVPR), 2019.
  • Liu et al. [2015] Siqi Liu, Sidong Liu, Weidong Cai, Hangyu Che, Sonia Pujol, Ron Kikinis, Dagan Feng, Michael J Fulham, et al. Multimodal neuroimaging feature learning for multiclass diagnosis of alzheimer’s disease. IEEE transactions on biomedical engineering, 62(4):1132–1140, 2015.
  • Liu et al. [2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision (ECCV), pages 21–37. Springer, 2016.
  • Mees et al. [2016] Oier Mees, Andreas Eitel, and Wolfram Burgard. Choosing smartly: Adaptive multimodal fusion for object detection in changing environments. In IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 151–156, 2016.
  • Mroueh et al. [2015] Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. Deep multimodal learning for audio-visual speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2130–2134, 2015.
  • Qi et al. [2018] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In IEEE conference on computer vision and pattern recognition (CVPR), pages 918–927, 2018.
  • Ramachandram and Taylor [2017] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34(6):96–108, 2017.
  • Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In IEEE conference on computer vision and pattern recognition (CVPR), pages 7263–7271, 2017.
  • Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR), pages 779–788, 2016.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), pages 91–99, 2015.
  • Sainath et al. [2013] Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for lvcsr. In IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 8614–8618, 2013.
  • Simonovsky et al. [2016] Martin Simonovsky, Benjamín Gutiérrez-Becker, Diana Mateus, Nassir Navab, and Nikos Komodakis. A deep metric for multimodal registration. In International conference on medical image computing and computer-assisted intervention, pages 10–18. Springer, 2016.
  • Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR), 2015.
  • Sui et al. [2015] Chao Sui, Mohammed Bennamoun, and Roberto Togneri. Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines. In IEEE international conference on computer vision (ICCV), pages 154–162, 2015.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition (CVPR), pages 1–9, 2015.
  • Tsipras et al. [2019] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International conference on learning representations (ICLR), 2019.
  • Tumer and Ghosh [1996a] Kagan Tumer and Joydeep Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2):341–348, 1996a.
  • Tumer and Ghosh [1996b] Kagan Tumer and Joydeep Ghosh. Error correlation and error reduction in ensemble classifiers. Connection science, 8(3-4):385–404, 1996b.
  • Valada et al. [2017] Abhinav Valada, Johan Vertens, Ankit Dhall, and Wolfram Burgard. Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In IEEE international conference on robotics and automation (ICRA), pages 4644–4651, 2017.
  • Wang et al. [2018] Zining Wang, Wei Zhan, and Masayoshi Tomizuka. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection. In IEEE intelligent vehicles symposium (IV), pages 1–6, 2018.
  • Wu et al. [2013] Pengcheng Wu, Steven CH Hoi, Hao Xia, Peilin Zhao, Dayong Wang, and Chunyan Miao. Online multimodal deep similarity learning with application to image retrieval. In 21st ACM international conference on multimedia, pages 153–162. ACM, 2013.

Appendix A Proofs and supplementary Analyses

A.1 Proofs and analyses for Section 3.2

Proof.

The original MaxSSN loss minimization problem with an additional constraint of preserving loss under clean data can be transformed to the problem stated in (3) due to the flexibility of g1 and g2 under the constraint g1+g2=β3:

ming1,g2max{(y,fdirect(x1+δ1,x2)),(y,fdirect(x1,x2+δ2))}s.t.g1+g2=β3

Under the expected squared loss with fdirect function, the loss can be evaluated,

(y,fdirect(x1+δ1,x2)) =𝔼[(y-(β1T(z1+ϵ1)+g1T(z3+ϵ3)+β2Tz2+g2Tz3))2]
=𝔼[(β1Tϵ1+g1Tϵ3)2]    (y=i=13βiTzi)
=σ2(||β1||22+||g1||22)    (Statistical assumption on ϵi.)

Hence the equivalent problem (6) is achieved.

σ2ming1,g2max{||β1||22+||g1||22,||β2||22+||g2||22}s.t.g1+g2=β3 (6)

For simple notation, substitute variables as g=g1,v=β3,c1=||β1||22,c2=||β2||22, and solve the following convex optimization problem.

mingmax{||g||22+c1,||g-v||22+c2}

This problem can be solved by introducing a variable γ for the upper bound of the inner maximum value:

ming,γγs.t.c1+||g||22-γ0,c2+||g-v||22-γ0

KKT condition gives:

(Primal feasibility) c1+||g||22-γ0,c2+||g-v||22-γ0
(Dual feasibility) λ10,λ20
(Complementary slackness) λ1(c1+||g||22-γ)=0,λ2(c2+||g-v||22-γ)=0
(Stationary) λ1+λ2=1,g=λ2λ1+λ2v

Considering λ1+λ2=1 and λ1,λ20, we first need to analyze the case λ1=0. This gives g=v and the complementary slackness condition to find γ=c2+||g-v||22=c2. λ2=0 can be analyzed with similar steps. If both λ1 and λ2 are positive, the complementary slackness condition gives γ=c1+||g||22=c2+||g-v||22, which ensures the balance of the original problem’s maximum value max{c1+||g||22,c2+||g-v||22}. This case gives γ=c1+c22+||v||224+(c2-c1)24||v||22 with g=12(1+c2-c1||v||22)v. Therefore, we can have the result (4) which provides the fusion model robust against single source corruptions from random noise. ∎

Comparison to the model not considering MaxSSN loss

If random noise are added to x1 and x2 simultaneously, the objective of the problem becomes ming1,g2𝔼[(y-fdirect(x1+δ1,x2+δ2))2] instead of considering the MaxSSN loss. This is equivalent to minimizing σ2(||β1||22+||β2||22+||g1||22+||g2||22) subject to g1+g2=β3, and the solution can be directly found as it is a simple convex problem, which is g1=g2=β32. If we denote this model as fdirect, then MaxSSN  loss is:

MaxSSN(fdirect,ϵ)=MaxSSN=σ2max{||β1||22+14||β3||22,||β2||22+14||β3||22}

Now, let’s compute the difference MaxSSN-MaxSSN*.

Proof.

As both term includes σ2, let’s assume σ2=1 for ease of notation. Among the three cases in (4), consider the first case ||β1||22+||β3||22||β2||22.

MaxSSN-MaxSSN*=||β2||22+14||β3||22-||β2||22=14||β3||22(||β2||22||β1||22)

The second case can be shown similarly. Now assume that |||β2||22-||β1||22||β3||22|<1 holds, and let ||β2||22||β1||22 without loss of generality. Then we can show that,

MaxSSN-MaxSSN* =||β2||22+14||β3||22-(||β1||22+||β2||222+||β3||224+(||β2||22-||β1||22)24||β3||22)
=12(||β2||22-||β1||22)(1-||β2||22-||β1||222||β3||22)
14(||β2||22-||β1||22)(||β2||22||β1||22 and |||β2||22-||β1||22||β3||22|<1)

Therefore we can conclude that simply optimizing under noise added to all the input sources at the same time cannot do better than minimizing MaxSSN loss with some nonnegative gap in our linear fusion model.

A.2 Single Source Robustness against Adversarial attacks

Another important type of perturbation is an adversarial attack. Different from the previously studied random noise, perturbation to the input sources is also optimized to maximize the loss to consider the worst case. Adversarial version of the MaxSSN loss is defined as follows:

Definition 3.

For multiple sources x1,,xns and a target variable y, denote a predefined loss function by L. If each input source xi is maximally perturbed with some additive noise ηiSi for i[ns], AdvMaxSSN loss for a model f is defined as follows:

AdvMaxSSN(f,η)maxi{maxηi𝒮i(y,f(xi+ηi,x-i))}i=1ns

As a simple model analysis, let’s consider a binary classification problem using the logistic regression. Again, two input sources x1=[z1;z3] and x2=[z2;z3] have a common feature vector z3 as in the linear fusion data model. A binary classifier sgn(f(x1,x2)) is trained to predict label y{-1,1}, where f(x1,x2)=(w1Tz1+g1Tz3)+(w2Tz2+g2Tz3) and the training loss is 𝔼x,y[(yf(x1,x2))] with the logistic function (x)=log(1+exp(-x)). Here, we apply one of the most popular attacks, fast gradient sign (FGS) method, which was also motivated by linear models without a fusion framework (Goodfellow et al., 2015). The adversarial attack ηi per each source xi under norm constraint ||ηi||ε can be similarly derived as follows:

η1=[-εysgn(w1);-εysgn(g1)],η2=[-εysgn(w2);-εysgn(g2)] (7)

As a substitute for the linear fusion data model, let’s assume the true classes are generated by the hidden relationship y=sgn(i=13βiTzi). Then the optimal fusion binary classifier becomes sgn(fdirect(x1,x2)). Similar to the previous section, suppose an objective is to find a model with robustness against single source adversarial attacks, while preserving the performance on clean data. Then the overall optimization problem can be reduced to the following one:

ming1,g2max{(y,fdirect(x1+η1,x2)),(y,fdirect(x1,x2+η2))}s.t.g1+g2=β3 (8)

As is a decreasing function, optimal g1 and g2 of the original problem are equivalent to the minimizer of the following one:

εming1,g2max{||w1||1+||g1||1,||w2||1+||g2||1}s.t.g1+g2=β3 (9)

By solving this convex optimization problem, we can achieve solution AdvMaxSSN* and optimizers g1*,g2*. Also, we can find AdvMaxSSN, a AdvMaxSSN value evaluated using the optimal model for minimizing the adversarial attacks added to all the sources at once. Interestingly, we can show that AdvMaxSSNAdvMaxSSN* if ||β2||1-||β1||1||β3||1>1, but AdvMaxSSN=AdvMaxSSN* otherwise. In other words, if inherent influence of z1 and z2 are well balanced compared to the common feature z3 in the sense of 1 norm, adversarial attacks only applied to a single source can be equivalently defended by just using a traditional adversarial training strategy to learn a model robust against attacks added to all the sources at once.

Proof.

The original minimizing AdvMaxSSN loss minimization problem with an additional constraint of preserving loss under clean data can be transformed to the problem stated in (8) due to the flexibility of g1 and g2:

ming1,g2max{(y,fdirect(x1+η1,x2)),(y,fdirect(x1,x2+η2))}s.t.g1+g2=β3

As ηi’s are assumed to be made with FGS method, adversarial attacks under norm constraints are as follows:

η1=[-εysgn(w1);-εysgn(g1)],η2=[-εysgn(w2);-εysgn(g2)]

Therefore, minimizing AdvMaxSSN(fdirect,η) over g1,g2 becomes:

ming1,g2max{ 𝔼[(yfdirect(x1,x2)-ε(||w1||1+||g1||1))],
𝔼[(yfdirect(x1,x2)-ε(||w2||1+||g2||1))]}s.t.g1+g2=β3

We can solve the following problem to find minimizers g1* and g2*.

ming1,g2max{||w1||1+||g1||1,||w2||1+||g2||1}s.t.g1+g2=β3

Similar to the random noise case, substitute variables as g=g1,v=β3,c1=||β1||1,c2=||β2||2, and solve the following convex optimization problem:

mingmax{||g||1+c1,||g-v||1+c2}

which can be solved by introducing γ,

ming,γγs.t.c1+||g||1-γ0,c2+||g-v||1-γ0

KKT condition gives:

(Primal feasibility) c1+||g||1-γ0,c2+||g-v||1-γ0
(Dual feasibility) λ10,λ20
(Complementary slackness) λ1(c1+||g||1-γ)=0,λ2(c2+||g-v||1-γ)=0
(Stationary) λ1+λ2=1,0λ1||g||1+λ2||g-v||1

If λ1=0 or λ2=0, these cases handle when the inherent imbalance of three components z1,z2 and z3. Consider λ2=0, which gives ||g||1+c1-γ=0 from the complementary slackness condition. And the stationary condition becomes 0||g||1. As a subgradient of ||g||1 can be zero if and only if g(i)=0 for any ith component, the solution is g=0 with γ=c1 and the necessary condition is ||v||1+c2c1. Similar solution can be found for λ1=0 case as g=v,γ=c2 if ||v||1+c1c2. Therefore, we can have γ*=minmax{||w1||1+||g1||1,||w2||1+||β3-g1||1} and corresponding parameters as:

(γ*,g1*,g2*)={(||β2||1,β3,0)if ||β1||1+||β3||1||β2||1(||β1||1,0,β3)if ||β2||1+||β3||1||β1||1

Now let’s consider λ10,λ20. Denote qλ1||g||1+λ2||g-v||1 as the element of subdifferential of the Lagrangian. We need to find cases for q(i)=0 to hold.

(i) If v(i)=0, then sgn(g(i))=sgn(g(i)-v(i)) holds. Therefore, if g(i)=0, a subgradient becomes q(i)=λ1sgn(g(i))+λ2sgn(g(i))=sgn(g(i)) which cannot be zero. g(i)=v(i)=0.

(ii) If v(i)0, we need to consider three different sub cases. First, if g(i)0 and g(i)v(i), then q(i)=λ1(sgn(g(i))-sgn(g(i)-v(i)))+sgn(g(i)-v(i)). For q(i)=0 to hold, sgn(g(i))=-sgn(g(i)-v(i)) must be true with λ1=12. This gives a solution g(i)=αiv(i) with αi(0,1).

Secondly, if g(i)=0 but g(i)v(i), then the subgradient is q(i)=λ1αi+(1-λ1)sgn(-v(i)) for any αi[-1,1]. Therefore, if αi=1-λ1λ1sgn(v(i)) with some λ1[12,1), the stationary condition holds.

Finally, if g(i)0 and g(i)=v(i), then q(i)=λ1sgn(g(i))+(1-λ1)αi for any αi[-1,1]. Therefore, if αi=λ11-λ1sgn(v(i)) with λ1(0,12], q(i)=0 holds for the stationary condition.

All the above cases in (i) and (ii) can be restated as a combined solution g(i)=αiv(i), αi[0,1]. It is easy to show that |g(i)|+|g(i)-v(i)|=|v(i)| holds for any i. Also, λ10,λ20 with the complementary slackness condition gives a new constraint γ=||g||1+c1=||g-v||1+c2. Hence, we can calculate γ by averaging the two equivalent values:

γ=12(c1+c2+||g||1+||g-v||1)=12(c1+c2+||v||1)

Therefore, (γ*,g1*,g2*)=(12(||β1||1+||β2||1+||β3||1),αβ3,β3-αβ3), where is an element-wise product and each element of α can have any value in [0,1], i.e. α(i)[0,1].

Now, let’s consider a model robust against adversarial attacks added to both sources x1 and x2 at the same time. This becomes a problem of minimizing ||β1||1+||β2||1+||g1||1+||β3-g1||1. And the optimal solution can be achieved by (g1,g2)=(αβ3,β3-αβ3) for any alpha satisfying α(i)[0,1]. Therefore, we can conclude that our AdvMaxSSN loss is necessary to give a binary classifier more robust against single source adversarial attacks, i.e. AdvMaxSSN*AdvMaxSSN, if ||β2||1-||β1||1||β3||1>1 holds. Surprisingly, if ||β2||1-||β1||1||β3||11 holds to have balanced influence from inherent components from the different source of inputs, AdvMaxSSN*=AdvMaxSSN. In other words, if different input sources contributes to the target variable with certain balance, a traditional way of generating adversarial samples by considering all the sources at once can train a model robust against single source attacks as well. ∎

Appendix B Additional Experimental Results

Fine-tuning

We also consider another algorithmic framework using fine-tuing. The algorithm starts with a normal training on clean data for mclean iterations, which may include some general data augmentation methods like random cropping, and flipping. Then mtune steps of fine-tuning is run to update only a subset of the model’s parameters, 𝜽fusionf, so that any essential parts for extracting features from normal data are not affected. Convolutional layers extracting features from different sources before the fusion stages are fixed, and other layers for fusing the features and making predictions are updated in the fine-tuning stage. The experimental results using this method are provided in Table 4 and 6 for the Gaussian noise case. Overall performance of the fusion model trained from the scratch is better than using fine-tuning. This shows the importance of feature extraction parts in deep learning models.

Evaluation on ASN data

Although our main focus is corruption on a single source, it is possible for a model to encounter a case where all the sources are corrupted. If the level of corruption is severe, then extracting any meaningful information from the input sources is impossible, e.g. occlusion on every sensors. However, we hope our model to be robust against reasonably corrupted input sources even if our training objective leans toward the single source robustness. Therefore, we also report the model’s performance against data corrupted with ASN. In most cases, the AVOD learned with TrainASN method achieves the best robustness against ASN, which is designed to do so. However, a model using element-wise mean fusion layers trained with TrainASN shows lower robustness scores compared to the SSN oriented approaches. We believe that this phenomenon is caused by corrupted feature extraction combined with the structural limitation of the mean fusion layer.

Table 3: Car detection (3D/BEV) performance of AVOD with element-wise mean fusion layers against Gaussian SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 76.41 72.74 66.86 89.33 86.49 79.44
+TrainASN 75.96 66.68 65.97 88.63 79.45 78.79
+TrainSSN 76.28 67.10 66.51 88.86 79.60 79.11
+TrainSSNAlt 77.46 67.61 66.06 89.68 86.71 79.41
(Gaussian ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 28.08±0.91 26.35±2.18 21.81±0.63 42.01±0.23 33.68±0.17 33.60±0.13
+TrainASN 61.26±0.45 47.71±0.24 45.60±0.19 87.40±0.07 72.07±2.89 70.13±0.05
+TrainSSN 69.33±0.43 55.41±0.21 52.90±2.12 88.39±0.13 78.37±0.10 70.75±0.05
+TrainSSNAlt 71.63±0.04 56.24±0.16 49.14±0.10 87.95±0.08 77.88±0.17 69.96±0.08
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 47.41±0.28 41.84±0.17 36.47±0.16 65.63±0.28 58.02±0.23 50.43±0.14
+TrainASN 61.53±0.57 52.72±0.08 47.25±0.13 87.71±0.14 78.37±0.06 77.85±0.08
+TrainSSN 71.65±0.31 62.14±0.08 56.78±0.12 88.21±0.08 78.90±0.09 77.92±0.11
+TrainSSNAlt 71.66±0.48 57.61±0.12 55.90±0.11 89.42±0.04 79.56±0.06 77.92±0.05
(Gaussian SSN) maxDiffAP3D(%) maxDiffAPBEV(%)
AVOD (Ku et al., 2018) 26.70±0.52 22.42±0.29 20.92±0.25 22.27±0.41 20.76±0.33 20.09±0.20
+TrainASN 14.48±0.82 12.72±0.33 11.18±0.27 0.88±0.22 0.48±0.13 0.28±0.12
+TrainSSN 3.71±0.46 3.42±0.25 7.50±0.25 0.36±0.17 0.04±0.15 0.71±0.17
+TrainSSNAlt 5.55±0.81 8.73±0.32 2.91±0.22 0.09±0.14 0.13±0.11 0.18±0.11
Table 4: Car detection (3D/BEV) performance of AVOD with element-wise mean fusion layers (trained with fine-tuning) against Gaussian SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 76.41 72.74 66.86 89.33 86.49 79.44
+TrainASN 62.55 55.81 55.34 79.08 69.90 69.83
+TrainSSN 73.50 65.66 64.74 88.27 85.65 78.98
+TrainSSNAlt 75.76 71.99 66.31 88.76 85.73 79.14
(Gaussian ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 28.08±0.91 26.35±2.18 21.81±0.63 42.01±0.23 33.68±0.17 33.60±0.13
+TrainASN 68.58±1.93 54.76±0.30 48.00±0.29 83.15±3.01 76.10±0.069 68.49±0.08
+TrainSSN 60.73±0.32 45.52±0.19 44.42±0.11 78.24±0.10 68.41±0.10 60.45±0.07
+TrainSSNAlt 53.25±0.27 44.96±0.14 38.64±0.10 68.69±0.18 59.41±0.14 51.37±0.07
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 47.41±0.28 41.84±0.17 36.47±0.16 65.63±0.28 58.02±0.23 50.43±0.14
+TrainASN 52.72±0.34 45.66±0.24 39.29±0.22 69.33±0.21 60.19±0.15 59.66±0.15
+TrainSSN 62.46±0.48 53.85±0.22 47.62±0.14 77.77±0.16 68.71±0.09 67.89±0.09
+TrainSSNAlt 70.09±0.46 56.20±0.21 54.46±0.13 84.46±2.66 76.32±0.06 68.74±0.08
Table 5: Car detection (3D/BEV) performance of AVOD with latent ensemble layers (LEL) against Gaussian SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 77.79 67.69 66.31 88.90 85.64 78.86
+TrainASN 75.00 64.75 58.28 88.30 78.60 77.23
+TrainSSN 74.25 65.00 63.83 87.88 78.84 77.66
+TrainSSNAlt 76.04 66.42 64.41 88.80 79.53 78.53
(Gaussian ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 46.79±0.37 41.46±0.27 36.31±0.20 77.40±0.34 67.46±0.11 59.53±0.11
+TrainASN 74.24±0.29 63.47±0.18 57.25±0.19 87.72±0.12 77.89±0.09 70.36±0.05
+TrainSSN 67.69±0.28 55.74±0.30 53.16±0.32 87.73±0.16 77.80±0.15 70.00±0.10
+TrainSSNAlt 63.72±0.40 53.15±0.29 48.17±0.22 85.36±0.08 75.60±0.08 69.17±0.03
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 61.97±0.55 53.95±0.42 47.24±0.27 79.44±0.09 72.46±3.14 68.25±0.06
+TrainASN 74.24±0.38 58.25±0.16 56.13±0.10 88.10±0.26 78.19±0.13 70.42±0.07
+TrainSSN 68.16±0.88 60.39±0.38 56.04±0.28 88.12±0.16 78.17±0.06 70.21±0.05
+TrainSSNAlt 68.63±0.40 55.48±0.16 54.42±0.17 86.51±0.46 76.85±0.11 71.95±2.72
(Gaussian SSN) maxDiffAP3D(%) maxDiffAPBEV(%)
AVOD (Ku et al., 2018) 3.75±2.05 0.98±0.55 5.95±0.40 7.28±0.37 4.46±3.25 1.25±0.13
+TrainASN 1.54±0.40 0.85±0.24 0.83±0.25 0.92±0.17 1.09±0.14 7.44±0.08
+TrainSSN 4.61±1.16 2.51±0.50 0.74±0.46 0.16±0.32 0.72±0.14 7.10±0.14
+TrainSSNAlt 4.65±1.04 7.88±0.46 2.90±0.45 1.12±0.71 1.83±0.17 3.42±2.84
Table 6: Car detection (3D/BEV) performance of AVOD with latent ensemble layers (LEL) (trained with fine-tuning) against Gaussian SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 77.79 67.69 66.31 88.90 85.64 78.86
+TrainASN 74.65 65.40 63.40 88.18 79.21 78.42
+TrainSSN 76.95 67.22 65.66 88.77 79.74 78.96
+TrainSSNAlt 76.81 67.46 66.12 88.47 79.62 78.86
(Gaussian ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 46.79±0.37 41.46±0.27 36.31±0.20 77.40±0.34 67.46±0.11 59.53±0.11
+TrainASN 63.73±0.24 53.16±0.16 47.79±0.17 80.18±0.07 76.26±0.03 69.12±0.04
+TrainSSN 60.80±0.48 47.73±0.13 45.67±0.15 79.82±0.22 69.66±0.10 68.38±0.10
+TrainSSNAlt 52.25±1.47 43.77±0.62 37.91±0.48 77.51±0.12 67.32±0.09 59.65±0.10
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 61.97±0.55 53.95±0.42 47.24±0.27 79.44±0.09 72.46±3.14 68.25±0.06
+TrainASN 68.08±0.44 57.28±0.18 55.27±0.20 86.45±0.08 77.19±0.08 69.57±0.08
+TrainSSN 67.98±1.31 55.61±0.23 53.76±0.20 86.87±0.12 77.56±0.05 69.81±0.08
+TrainSSNAlt 62.76±0.41 52.14±0.26 46.55±0.13 85.34±2.36 75.72±0.04 68.60±0.02

Results on downsampling corruption

Downsampling the LIDAR sensor is important as it is not clear whether a model trained with a high-resolution sensor will still work with a low-resolution one. In fact, reducing the number of lasers of a LIDAR is directly related to its price, which an important practical issue in deploying an actual autonomous vehicle. As the rotating LIDAR sensor used in the KITTI dataset outputs point clouds with a horizontal structure, an RGB image’s horizontal lines are also set to black to match the information loss ratio 1/4. Table 7 fully reports the performance of AVOD using our LEL when downsampling is considered as a corruption method.

Table 7: Car detection (3D/BEV) performance of AVOD with latent ensemble layers (LEL) against downsampling SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 77.79 67.69 66.31 88.90 85.64 78.86
+TrainASN 71.74 61.78 60.26 87.29 77.08 75.89
+TrainSSN 75.54 66.26 63.72 88.07 79.18 78.03
+TrainSSNAlt 76.22 66.05 63.87 89.00 79.65 78.03
(Downsample ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 36.13 27.39 26.39 77.60 59.84 51.82
+TrainASN 71.30 56.04 49.08 85.66 70.17 68.55
+TrainSSN 64.88 48.92 47.06 86.21 69.26 61.48
+TrainSSNAlt 48.98 36.30 31.06 75.00 51.35 49.60
(Downsample SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 61.70 51.66 46.17 86.08 69.99 61.55
+TrainASN 65.74 53.49 51.35 82.27 67.88 65.79
+TrainSSN 73.33 57.85 54.91 86.61 76.07 68.59
+TrainSSNAlt 64.77 53.34 48.29 85.27 69.87 67.77
(Downsample SSN) maxDiffAP3D(%) maxDiffAPBEV(%)
AVOD (Ku et al., 2018) 11.71 5.88 3.59 1.96 7.60 8.65
+TrainASN 10.00 11.34 11.76 6.53 11.23 12.40
+TrainSSN 0.94 5.71 3.11 1.74 2.36 9.00
+TrainSSNAlt 6.98 3.63 1.34 1.67 0.12 0.81

Concatenation

Our analyses in Section 3 assume to use a linear fusion model with a simple concatenation strategy. Therefore, we first train the AVOD model with concatenation fusion layers on clean data and fine-tune with different training strategies. Interestingly, a simple data augmentation strategy TrainSSNAlt does not work well in this case, and TrainASN algorithm learns the best robust model. Unlike our simple linear model deep learning jointly learns both feature representation and weights for the fusion layers. Also, concatenated convolutional features have large number of channels which are mixed without sparse constraints. Therefore, this may lead to a model with too complex joint feature representation which needs stronger guideline in optimization steps.

Table 8: Car detection (3D/BEV) performance of AVOD with concatenation fusion layers (trained with fine-tuning) against Gaussian SSN and ASN on the KITTI validation set.
(Data) Train Algo. Easy Moderate Hard Easy Moderate Hard
(Clean Data) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 78.40 74.88 67.78 89.74 87.76 79.83
+TrainASN 72.89 63.47 62.22 88.44 84.97 78.88
+TrainSSN 76.15 66.79 65.78 89.02 86.06 79.29
+TrainSSNAlt 76.46 72.98 66.94 89.07 86.39 79.34
(Gaussian ASN) AP3D(%) APBEV(%)
AVOD (Ku et al., 2018) 16.50±2.27 15.12±0.06 15.06±0.08 25.81±0.23 25.38±0.18 17.45±0.08
+TrainASN 69.21±0.24 54.85±0.08 53.30±0.08 86.07±0.11 76.42±0.04 69.54±0.02
+TrainSSN 62.05±0.36 50.35±2.58 46.04±0.25 79.21±0.08 69.31±0.10 61.21±0.06
+TrainSSNAlt 33.86±2.85 27.99±0.64 22.59±0.60 42.65±0.18 41.77±0.18 34.13±0.12
(Gaussian SSN) minAP3D(%) minAPBEV(%)
AVOD (Ku et al., 2018) 31.23±0.31 30.27±0.13 30.49±0.18 43.04±0.16 42.81±0.10 42.96±0.08
+TrainASN 68.21±0.37 54.50±0.26 47.91±0.21 86.66±0.11 76.95±0.11 69.70±0.08
+TrainSSN 64.39±0.23 55.12±0.21 48.38±0.14 79.71±0.07 70.05±0.07 69.32±0.10
+TrainSSNAlt 44.25±0.49 37.23±0.44 37.58±0.34 59.06±0.12 51.19±0.08 51.28±0.06