Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects

  • 2018-11-28 13:39:27
  • Michael A. Alcorn, Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-Shinn Ku, Anh Nguyen
  • 117

Abstract

Despite excellent performance on stationary test sets, deep neural networks(DNNs) can fail to generalize to out-of-distribution (OoD) inputs, includingnatural, non-adversarial ones, which are common in real-world settings. In thispaper, we present a framework for discovering DNN failures that harnesses 3Drenderers and 3D models. That is, we estimate the parameters of a 3D rendererthat cause a target DNN to misbehave in response to the rendered image. Usingour framework and a self-assembled dataset of 3D objects, we investigate thevulnerability of DNNs to OoD poses of well-known objects in ImageNet. Forobjects that are readily recognized by DNNs in their canonical poses, DNNsincorrectly classify 97% of their pose space. In addition, DNNs are highlysensitive to slight pose perturbations. Importantly, adversarial poses transferacross models and datasets. We find that 99.9% and 99.4% of the posesmisclassified by Inception-v3 also transfer to the AlexNet and ResNet-50 imageclassifiers trained on the same ImageNet dataset, respectively, and 75.5%transfer to the YOLOv3 object detector trained on MS COCO.

 

Quick Read (beta)

Strike (with) a Pose: Neural Networks Are Easily Fooled
by Strange Poses of Familiar Objects

Michael A. Alcorn
[email protected]
   Qi Li
[email protected]
   Zhitao Gong
[email protected]
   Chengfei Wang
[email protected]
   Long Mai
[email protected]
   Wei-Shinn Ku
[email protected]
   Anh Nguyen
[email protected]
   Auburn University         Adobe Inc.       
Abstract

Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including natural, non-adversarial ones, which are common in real-world settings. In this paper, we present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. That is, we estimate the parameters of a 3D renderer that cause a target DNN to misbehave in response to the rendered image. Using our framework and a self-assembled dataset of 3D objects, we investigate the vulnerability of DNNs to OoD poses of well-known objects in ImageNet. For objects that are readily recognized by DNNs in their canonical poses, DNNs incorrectly classify 97% of their pose space. In addition, DNNs are highly sensitive to slight pose perturbations. Importantly, adversarial poses transfer across models and datasets. We find that 99.9% and 99.4% of the poses misclassified by Inception-v3 also transfer to the AlexNet and ResNet-50 image classifiers trained on the same ImageNet dataset, respectively, and 75.5% transfer to the YOLOv3 object detector trained on MS COCO.

1 Introduction

For real-world technologies, such as self-driving cars [9], autonomous drones [12], and search-and-rescue robots [32], the test distribution may be non-stationary, and new observations will often be out-of-distribution (OoD), i.e., not from the training distribution [37]. However, machine learning (ML) models frequently assign wrong labels with high confidence to OoD examples, such as adversarial examples [40, 27]—inputs specially crafted by an adversary to cause a target model to misbehave. But ML models are also vulnerable to natural OoD examples [19, 2, 41, 3]. For example, when a Tesla autopilot car failed to recognize a white truck against a bright-lit sky—an unusual view that might be OoD—it crashed into the truck, killing the driver [3].

(a)               (b)               (c)                 (d)

Figure 1: The Google Inception-v3 classifier [38] correctly labels the canonical poses of objects (a), but fails to recognize out-of-distribution images of objects in unusual poses (b–d), including real photographs retrieved from the Internet (d). The left 3×3 images (a–c) are found by our framework and rendered via a 3D renderer. Below each image are its top-1 predicted label and confidence score.

To understand such natural Type II classification errors, we searched for 6D poses (i.e., 3D translations and 3D rotations) of 3D objects that caused DNNs to misclassify. Our results reveal that state-of-the-art image classifiers and object detectors trained on large-scale image datasets [31, 20] misclassify most poses for many familiar training-set objects. For example, DNNs predict the front view of a school bus—an object in the ImageNet dataset [31]—extremely well (Fig. 1a) but fail to recognize the same object when it is too close or flipped over, i.e., in poses that are OoD yet exist in the real world (Fig. 1d).

Addressing this type of OoD error is a non-trivial challenge. First, objects on roads may appear in an infinite variety of poses [3, 2]. Second, these OoD poses come from known objects and should be assigned known labels rather than being rejected as unknown objects [15, 33]. Moreover, a self-driving car needs to correctly estimate at least some attributes of an incoming, unknown object (instead of simply rejecting it) to handle the situation gracefully and minimize damage.

In this paper, we propose a framework for finding OoD errors in computer vision models in which iterative optimization in the parameter space of a 3D renderer is used to estimate changes (e.g., in object geometry and appearance, lighting, background, or camera settings) that cause a target DNN to misbehave (Fig. 2). With our framework, we generated unrestricted 6D poses of 3D objects and studied how DNNs respond to 3D translations and 3D rotations of objects. For our study, we built a dataset of 3D objects corresponding to 30 ImageNet classes relevant to the self-driving car application. All code and data for our experiments will be available at https://github.com/airalcorn2/strike-with-a-pose. In addition, we will release a simple GUI tool that allows users to generate their own adversarial poses of an object.

Our main findings are:

  • ImageNet classifiers only correctly label 3.09% of the entire 6D pose space of a 3D object, and misclassify many generated adversarial examples (AXs) that are human-recognizable (Fig. 1b–c). A misclassification can be found via a change as small as 10.31°, 8.02°, and 9.17° to the yaw, pitch, and roll, respectively.

  • 99.9% and 99.4% of AXs generated against Inception-v3 transfer to the AlexNet and ResNet-50 image classifiers, respectively, and 75.5% transfer to the YOLOv3 object detector.

  • Training on adversarial poses generated by the 30 objects (in addition to the original ImageNet data) did not help DNNs generalize well to held-out objects in the same class.

In sum, our work shows that state-of-the-art DNNs perform image classification well but are still far from true object recognition. While it might be possible to improve DNN robustness through adversarial training with many more 3D objects, we hypothesize that future ML models capable of visual reasoning may instead benefit from strong 3D geometry priors.

Figure 2: To test a target DNN, we build a 3D scene (a) that consists of 3D objects (here, a school bus and a pedestrian), lighting, a background scene, and camera parameters. Our 3D renderer renders the scene into a 2D image, which the image classifier labels school bus. We can estimate the pose changes of the school bus that cause the classifier to misclassify by (1) approximating gradients via finite differences; or (2) backpropagating (red dashed line) through a differentiable renderer.

2 Framework

2.1 Problem formulation

Let f be an image classifier that maps an image 𝐱H×W×C onto a softmax probability distribution over 1,000 output classes [38]. Let R be a 3D renderer that takes as input a set of parameters ϕ and outputs a render, i.e., a 2D image R(ϕ)H×W×C (see Fig. 2).

Typically, ϕ is factored into mesh vertices V, texture images T, a background image B, camera parameters C, and lighting parameters L, i.e., ϕ={V,T,B,C,L} [17]. To change the 6D pose of a given 3D object, we apply a set of 3D rotations and 3D translations, parameterized by θ6, to the original vertices V, yielding a new set of vertices V*.

Here, we wish to estimate only the pose transformation parameters 𝐰 (while keeping all parameters in ϕ fixed) such that the rendered image R(𝐰;ϕ) causes the classifier f to assign the highest probability (among all outputs) to an incorrect target output at index t. Formally, we attempt to solve the below optimization problem:

𝐰*=argmax𝐰(ft(R(𝐰;ϕ))) (1)

In practice, we minimize the cross-entropy loss for the target class. Eq. 1 may be solved efficiently via backpropagation if both f and R are differentiable, i.e., we are able to compute /𝐰. However, standard 3D renderers, e.g., OpenGL [43], typically include many non-differentiable operations and cannot be inverted [25]. Therefore, we attempted two approaches: (1) harnessing a recently proposed differentiable renderer and performing gradient descent using its analytical gradients; and (2) harnessing a non-differentiable renderer and approximating the gradient via finite differences.

We will next describe the target classifier (Sec. 2.2), the renderers (Sec. 2.3), and our dataset of 3D objects (Sec. 2.4) before discussing the optimization methods (Sec. 3).

2.2 Classification networks

We chose the well-known, pre-trained Google Inception-v3 [39] DNN from the PyTorch model zoo [29] as the main image classifier for our study (the default DNN if not otherwise stated). The DNN has a 77.45% top-1 accuracy on the ImageNet ILSVRC 2012 dataset [31] of 1.2 million images corresponding to 1,000 categories.

2.3 3D renderers

Non-differentiable renderer. We chose ModernGL [1] as our non-differentiable renderer. ModernGL is a simple Python interface for the well-known OpenGL graphics engine. ModernGL supports fast, GPU-accelerated rendering.

Differentiable renderer. To enable backpropagation through the non-differentiable rasterization process, Kato et al. [17] replaced the discrete pixel color sampling step with a linear interpolation sampling scheme that admits non-zero gradients. While the approximation enables gradients to flow from the output image back to the renderer parameters ϕ, the render quality is lower than that of our non-differentiable renderer (see Fig. S1 for a comparison). Hereafter, we refer to the two renderers as NR and DR.

2.4 3D object dataset

Construction. Our main dataset consists of 30 unique 3D object models (purchased from many 3D model marketplaces) corresponding to 30 ImageNet classes relevant to a traffic environment (Fig. S2). The 30 classes include 20 vehicles (e.g., school bus and cab) and 10 street-related items (e.g., traffic light). See Sec. S1 for more details.

Each 3D object is represented as a mesh, i.e., a list of triangular faces, each defined by three vertices [25]. The 30 meshes have on average 9,908 triangles (Table S1). To maximize the realism of the rendered images, we used only 3D models that have high-quality 2D image textures. We did not choose 3D models from public datasets, e.g., ObjectNet3D [44], because most of them do not have high-quality image textures. That is, the renders of such models may be correctly classified by DNNs but still have poor realism.

Evaluation. We recognize that a reality gap will often exist between a render and a real photo. Therefore, we rigorously evaluated our renders to make sure the reality gap was acceptable for our study. From 100 initially-purchased 3D models, we selected the 30 highest-quality models using the evaluation method below.

First, we quantitatively evaluated DNN predictions on the renders. For each object, we sampled 36 unique views (common in ImageNet) evenly divided into three sets. For each set, we set the object at the origin, the up direction to (0,1,0), and the camera position to (0,0,-z) where z={4,6,8}. We sampled 12 views per set by starting the object at a 10 yaw and generating a render at every 30 yaw-rotation. Across all objects and all renders, the Inception-v3 top-1 accuracy was 83.23% (compared to 77.45% on ImageNet images [38]) with a mean top-1 confidence score of 0.78 (Table S2). See Sec. S1 for more details.

Second, we qualitatively evaluated the renders by comparing them to real photos. We produced 56 (real photo, render) pairs via three steps: (1) we retrieved real photos of an object (e.g., a car) from the Internet; (2) we replaced the object with matching background content in Adobe Photoshop; and (3) we manually rendered the 3D object on the background such that its pose closely matched that in the reference photo. Fig. S3 shows example (real photo, render) pairs. While discrepancies can be spotted in our side-by-side comparisons, we found that most of the renders passed our human visual Turing test if presented alone.

2.5 Background images

Previous studies have shown that image classifiers may be able to correctly label an image when foreground objects are removed (i.e., based on only the background content) [46]. Because the purpose of our study was to understand how DNNs recognize an object itself, a non-empty background would have hindered our interpretation of the results. Therefore, we rendered all images against a plain background with RGB values of (0.485,0.456,0.406), i.e., the mean pixel of ImageNet images. Note that the presence of a non-empty background should not alter our main qualitative findings in this paper—adversarial poses can be easily found against real background photos (Fig. 1).

3 Methods

We will describe the common pose transformations (Sec. 3.1) used in the main experiments. We were able to experiment with non-gradient methods because: (1) the pose transformation space 6 that we optimize in is fairly low-dimensional; and (2) although the NR is non-differentiable, its rendering speed is several orders of magnitude faster than that of DR. In addition, our preliminary results showed that the objective function considered in Eq. 1 is highly non-convex (see Fig. 4), therefore, it is interesting to compare (1) random search vs. (2) gradient descent using finite-difference (FD) approximated gradients vs. (3) gradient descent using the DR gradients.

3.1 Pose transformations

We used standard computer graphics transformation matrices to change the pose of 3D objects [25]. Specifically, to rotate an object with geometry defined by a set of vertices V={vi}, we applied the linear transformations in Eq. 2 to each vertex vi3:

viR=RyRpRrvi (2)

where Ry, Rp, and Rr are the 3×3 rotation matrices for yaw, pitch, and roll, respectively (the matrices can be found in Sec. S5). We then translated the rotated object by adding a vector T=[xδyδzδ] to each vertex:

viR,T=T+viR (3)

In all experiments, the center c3 of the object was constrained to be inside a sub-volume of the camera viewing frustum. That is, the x-, y-, and z-coordinates of c were within [-s,s],[-s,s],[-28,0], respectively, with s being the maximum value that would keep c within the camera frame. Specifically, s is defined as:

s=dtan(θv) (4)

where θv is one half the camera’s angle of view (i.e., 8.213° in our experiments) and d is the absolute value of the difference between the camera’s z-coordinate and zδ.

3.2 Random search

In reinforcement learning problems, random search (RS) can be surprisingly effective compared to more sophisticated methods [36]. For our RS procedure, instead of iteratively following some approximated gradient to solve the optimization problem in Eq. 1, we simply randomly selected a new pose in each iteration. The rotation angles for the matrices in Eq. 2 were uniformly sampled from (0,2π). xδ, yδ, and zδ were also uniformly sampled from the ranges defined in Sec. 3.1.

3.3 zδ-constrained random search

Our preliminary RS results suggest the value of zδ (which is a proxy for the object’s size in the rendered image) has a large influence on a DNN’s predictions. Based on this observation, we used a zδ-constrained random (ZRS) search procedure both as an initializer for our gradient-based methods and as a naive performance baseline (for comparisons in Sec. 4.4). The ZRS procedure consisted of generating 10 random samples of (xδ,yδ,θy,θp,θr) at each of 30 evenly spaced zδ from -28 to 0.

When using ZRS for initialization, the parameter set with the maximum target probability was selected as the starting point. When using the procedure as an attack method, we first gathered the maximum target probabilities for each zδ, and then selected the best two zδ to serve as the new range for RS.

3.4 Gradient descent with finite-difference

We calculated the first-order derivatives via finite central differences and performed vanilla gradient descent to iteratively minimize the cross-entropy loss for a target class. That is, for each parameter 𝐰i, the partial derivative is approximated by:

𝐰i=(𝐰i+h2)-(𝐰i-h2)h (5)

Although we used an h of 0.001 for all parameters, a different step size can be used per parameter. Because radians have a circular topology (i.e., a rotation of 0 radians is the same as a rotation of 2π radians, 4π radians, etc.), we parameterized each rotation angle θi as (cos(θi),sin(θi))—a technique commonly used for pose estimation [28] and inverse kinematics [10]—which maps the Cartesian plane to angles via the atan2 function. Therefore, we optimized in a space of 3+2×3=9 parameters.

The approximate gradient obtained from Equation (5) served as the gradient in our gradient descent. We used the vanilla gradient descent update rule:

𝐰𝐰-γ(𝐰) (6)

with a learning rate γ of 0.001 for all parameters and optimized for 100 steps (no other stopping criteria).

4 Experiments and results

4.1 Neural networks are easily confused by object rotations and translations

(a) Incorrect classifications
(b) Correct classifications
Figure 3: The distributions of individual pose parameters for (a) high-confidence (p0.7) incorrect classifications and (b) correct classifications obtained from the random sampling procedure described in Sec. 3.2. xδ and yδ have been normalized w.r.t. their corresponding s from Eq. 4.

Experiment. To test DNN robustness to object rotations and translations, we used RS to generate samples for every 3D object in our dataset. In addition, to explore the impact of lighting on DNN performance, we considered three different lighting settings: 𝖻𝗋𝗂𝗀𝗁𝗍, 𝗆𝖾𝖽𝗂𝗎𝗆, and 𝖽𝖺𝗋𝗄 (example renders in Fig. S10). In all three settings, both the directional light and the ambient light were white in color, i.e., had RGB values of (1.0,1.0,1.0), and the directional light was oriented at (0,-1,0) (i.e., pointing straight down). The directional light intensities and ambient light intensities were (1.2,1.6), (0.4,1.0), and (0.2,0.5) for the 𝖻𝗋𝗂𝗀𝗁𝗍, 𝗆𝖾𝖽𝗂𝗎𝗆, and 𝖽𝖺𝗋𝗄 settings, respectively. All other experiments used the 𝗆𝖾𝖽𝗂𝗎𝗆 lighting setting.

Misclassifications uniformly cover the pose space. For each object, we calculated the DNN accuracy (i.e., percent of correctly classified samples) across all three lighting settings (Table S5). The DNN was wrong for the vast majority of samples, i.e., the median percent of correct classifications for all 30 objects was only 3.09%. Moreover, high-confidence misclassifications (p0.7) are largely uniformly distributed across every pose parameter (Fig. 2(a)), i.e., AXs can be found throughout the parameter landscape (see Fig. S15 for examples). In contrast, correctly classified examples are highly multimodal w.r.t. the rotation axis angles and heavily biased towards zδ values that are closer to the camera (Fig. 2(b)).

An object can be misclassified as many different labels. Previous research has shown that it is relatively easy to produce AXs corresponding to many different classes when optimizing input images [40] or 3D object textures [6], which are very high-dimensional. When finding adversarial poses, one might expect—because all renderer parameters, including the original object geometry and textures, are held constant—the success rate to depend largely on the similarities between a given 3D object and examples of the target in ImageNet. Interestingly, across our 30 objects, RS discovered 990/1000 different ImageNet classes (132 of which were shared between all objects). When only considering high-confidence (p0.7) misclassifications, our 30 objects were still misclassified into 797 different classes with a median number of 240 incorrect labels found per object (see Fig. S16 and Fig. S6 for examples). Across all adversarial poses and objects, DNNs tend to be more confident when correct than when wrong (the median of median probabilities were 0.41 vs. 0.21, respectively).

(a)
(b)
Figure 4: Inception-v3’s ability to correctly classify images is highly localized in the rotation and translation parameter space. (a) The classification landscape for a fire truck object when altering θr and θp and holding (xδ,yδ,zδ,θy) at (0,0,-3,π4). Light regions correspond to correct classifications while dark regions correspond to incorrect classifications. Green and red circles indicate correct and incorrect classifications, respectively, corresponding to the fire truck object poses found in (b).

4.2 Common object classifications are shared across different lighting settings

Here, we analyze how our results generalize across different lighting conditions. From the data produced in Sec. 4.1, for each object, we calculated the DNN accuracy under each lighting setting. Then, for each object, we took the absolute difference of the accuracies for all three lighting combinations (i.e., 𝖻𝗋𝗂𝗀𝗁𝗍 vs. 𝗆𝖾𝖽𝗂𝗎𝗆, 𝖻𝗋𝗂𝗀𝗁𝗍 vs. 𝖽𝖺𝗋𝗄, and 𝗆𝖾𝖽𝗂𝗎𝗆 vs. 𝖽𝖺𝗋𝗄) and recorded the maximum of those values. The median “maximum absolute difference” of accuracies for all objects was 2.29% (compared to the median accuracy of 3.09% across all lighting settings). That is, DNN accuracy is consistently low across all lighting conditions. Lighting changes would not alter the fact that DNNs are vulnerable to adversarial poses.

We also recorded the 50 most frequent classes for each object under the different lighting settings (Sb, Sm, and Sd). Then, for each object, we computed the intersection over union score oS for these sets:

oS=100|SbSmSd||SbSmSd| (7)

The median oS for all objects was 47.10%. That is, for 15 out of 30 objects, 47.10% of the 50 most frequent classes were shared across lighting settings. While lighting does have an impact on DNN misclassifications (as expected), the large number of shared labels across lighting settings suggests ImageNet classes are strongly associated with certain adversarial poses regardless of lighting.

4.3 Correct classifications are highly localized in the rotation and translation landscape

To gain some intuition for how Inception-v3 responds to rotations and translations of an object, we plotted the probability and classification landscapes for paired parameters (e.g., Fig. 4; pitch vs. roll) while holding the other parameters constant. We qualitatively observed that the DNN’s ability to recognize an object (e.g., a fire truck) in an image varies radically as the object is rotated in the world (Fig. 4).

Experiment. To quantitatively evaluate the DNN’s sensitivity to rotations and translations, we tested how it responded to single parameter disturbances. For each object, we randomly selected 100 distinct starting poses that the DNN had correctly classified in our random sampling runs. Then, for each parameter (e.g., yaw rotation angle), we randomly sampled 100 new values11 1 using the random sampling procedure described in Sec. 3.2 while holding the others constant. For each sample, we recorded whether or not the object remained correctly classified, and then computed the failure (i.e., misclassification) rate for a given (object, parameter) pair. Plots of the failure rates for all (object, parameter) combinations can be found in Fig. S18.

Additionally, for each parameter, we calculated the median of the median failure rates. That is, for each parameter, we first calculated the median failure rate for all objects, and then calculated the median of those medians for each parameter. Further, for each (object, starting pose, parameter) triple, we recorded the magnitude of the smallest parameter change that resulted in a misclassification. Then, for each (object, parameter) pair, we recorded the median of these minimum values. Finally, we again calculated the median of these medians across objects (Table 1).

Results. As can be seen in Table 1, the DNN is highly sensitive to all single parameter disturbances, but it is especially sensitive to disturbances along the depth (zδ), pitch (θp), and roll (θr). Note that a change in rotation as small as 8.02° can cause an object to be misclassified (see Table 1). We also observed that correctly classified poses are highly similar while misclassified poses are diverse by comparing two t-SNE plots of these two sets of poses (Fig. S4 vs. Fig. S6).

Parameter Fail Rate (%) Min. Δ
xδ 42 0.11
yδ 49 0.09
zδ 81 0.69
θy 69 0.18 (10.31°)
θp 83 0.14 (8.02°)
θr 81 0.16 (9.17°)
Table 1: The median of the median failure rates and the median of the median minimum disturbances (Min. Δ) for the single parameter sensitivity tests described in Section 4.3. See main text and Fig. S18 for additional information.

4.4 Optimization methods can effectively generate targeted adversarial poses

Given a challenging, highly non-convex objective landscape (Fig. 4), we wish to evaluate the effectiveness of two different types of approximate gradients at targeted attacks, i.e., finding adversarial examples misclassified as a target class [40]. Here, we compare (1) random search; (2) gradient descent with finite-difference gradients (FD-G); and (3) gradient descent with analytical, approximate gradients provided by a differentiable renderer (DR-G) [17].

Experiment. Because our adversarial pose attacks are inherently constrained by the fixed geometry and appearances of a given 3D object (see Sec. 4.1), we defined the targets to be the 50 most frequent incorrect classes found by our RS procedure for each object. For each (object, target) pair, we ran 50 optimization trials using ZRS, FD-G, and DR-G. All treatments were initialized with a pose found by the ZRS procedure and then allowed to optimize for 100 iterations.

Results. For each of the 50 optimization trials, we recorded both whether or not the target was hit and the maximum target probability obtained during the run. For each (object, target) pair, we calculated the percent of target hits and the median maximum confidence score of the target labels (see Table 2). As shown in Table 2, FD-G is substantially more effective than ZRS at generating targeted adversarial poses, having both higher median hit rates and confidence scores. In addition, we found the approximate gradients from DR to be surprisingly noisy, and DR-G largely underperformed even non-gradient methods (ZRS) (see Sec. S4).

Hit Rate (%) Target Prob.
ZRS     random search 78 0.29
FD-G   gradient-based 92 0.41
DR-G gradient-based 32 0.22
Table 2: The median percent of target hits and the median of the median target probabilities for random search (ZRS), gradient descent with finite difference gradients (FD-G), and DR gradients (DR-G). All attacks are targeted and initialized with zδ-constrained random search. DR-G is not directly comparable to FD-G and ZRS (details in Sec. S3).

4.5 Adversarial poses transfer to different image classifiers and object detectors

The most important property of previously documented AXs is that they transfer across ML models, enabling black-box attacks [45]. Here, we investigate the transferability of our adversarial poses to (a) two different image classifiers, AlexNet [18] and ResNet-50 [14], trained on the same ImageNet dataset; and (b) an object detector YOLOv3 [30] trained on the MS COCO dataset [20].

For each object, we randomly selected 1,350 AXs that were misclassified by Inception-v3 with high confidence (p0.9) from our untargeted RS experiments in Sec. 4.1. We exposed the AXs to AlexNet and ResNet-50 and calculated their misclassification rates. We found that almost all AXs transfer with median misclassification rates of 99.9% and 99.4% for AlexNet and ResNet-50, respectively. In addition, 10.1% of AlexNet misclassifications and 27.7% of ResNet-50 misclassifications were identical to the Inception-v3 predicted labels.

There are two orthogonal hypotheses for this result. First, the ImageNet training-set images themselves may contain a strong bias towards common poses, omitting uncommon poses (Sec. S6 shows supporting evidence from a nearest-neighbor test). Second, the models themselves may not be robust to even slight disturbances of the known, in-distribution poses.

Object detectors. Previous research has shown that object detectors can be more robust to adversarial attacks than image classifiers [23]. Here, we investigate how well our AXs transfer to a state-of-the-art object detector—YOLOv3. YOLOv3 was trained on MS COCO, a dataset of bounding boxes corresponding to 80 different object classes. We only considered the 13 objects that belong to classes present in both the ImageNet and MS COCO datasets. We found that 75.5% of adversarial poses generated for Inception-v3 are also misclassified by YOLOv3 (see Sec. S2 for more details). These results suggest the adversarial pose problem transfers across datasets, models, and tasks.

4.6 Adversarial training

One of the most effective methods for defending against OoD examples has been adversarial training [13], i.e. augmenting the training set with AXs—also a common approach in anomaly detection [8]. Here, we test whether adversarial training can improve DNN robustness to new poses generated for (1) our 30 training-set 3D objects; and (2) seven held-out 3D objects.

Training. We augmented the original 1,000-class ImageNet dataset with an additional 30 AX classes. Each AX class included 1,350 randomly selected high-confidence (p0.9) misclassified images split 1,300/50 into training/validation sets. Our AlexNet trained on the augmented dataset (AT) achieved a top-1 accuracy of 0.565 for the original ImageNet validation set and a top-1 accuracy22 2 In this case, a classification was “correct” if it matched either the original ImageNet positive label or the negative, object label. of 0.967 for the AX validation set.

PT AT
Error (T) 99.67 6.7
Error (H) 99.81 89.2
High-confidence Error (T) 87.8 1.9
High-confidence Error (H) 48.2 33.3
Table 3: The median percent of misclassifications (Error) and high-confidence (i.e., p>0.7) misclassifications by the pre-trained AlexNet (PT) and our AlexNet trained with adversarial examples (AT) on random poses of training-set objects (T) and held-out objects (H).

Evaluation. To evaluate our AT model vs. a pre-trained AlexNet (PT), we used RS to generate 106 samples for each of our 3D training objects. In addition, we collected seven held-out 3D objects not included in the training set that belong to the same classes as seven training-set objects (example renders in Fig. S14). We followed the same sampling procedure for the held-out objects to evaluate whether our AT generalizes to unseen objects.

For each of these 30+7=37 objects and for both the PT and our AT, we recorded two statistics: (1) the percent of misclassifications, i.e. errors; and (2) the percent of high-confidence (i.e., p0.7) misclassifications (Table 3). Following adversarial training, the accuracy of the DNN substantially increased for known objects (Table 3; 99.67% vs. 6.7%). However, our AT still misclassified the adversarial poses of held-out objects at an 89.2% error rate.

We hypothesize that augmenting the dataset with many more 3D objects may improve DNN generalization on held-out objects. Here, AT might have used (1) the grey background to separate the 1,000 original ImageNet classes from the 30 AX classes; and (2) some non-geometric features sufficient to discriminate among only 30 objects. However, as suggested by our work (Sec. 2.4), acquiring a large-scale, high-quality 3D object dataset is costly and labor-intensive. Currently, no such public dataset exists, and thus we could not test this hypothesis.

5 Related work

Out-of-distribution detection. OoD classes, i.e., classes not found in the training set, present a significant challenge for computer vision technologies in real-world settings [33]. Here, we study an orthogonal problem—correctly classifying OoD poses of objects from known classes. While rejecting to classify is a common approach for handling OoD examples [15, 33], the OoD poses in our work come from known classes and thus should be assigned correct labels.

2D adversarial examples. Numerous techniques for crafting AXs that fool image classifiers have been discovered [45]. However, previous work has typically optimized in the 2D input space [45], e.g., by synthesizing an entire image [27], a small patch [16, 11], a few pixels [7], or only a single pixel [35]. But pixel-wise changes are uncorrelated [26], so pixel-based attacks may not transfer well to the real world [22, 24] because there is an infinitesimal chance that such specifically crafted, uncorrelated pixels will be encountered in the vast physical space of camera, lighting, traffic, and weather configurations.

3D adversarial examples. Athalye et al. [6] used a 3D renderer to synthesize textures for a 3D object such that, under a wide range of camera views, the object was still rendered into an effective AX. We also used 3D renderers, but instead of optimizing textures, we optimized the poses of known objects to cause DNNs to misclassify (i.e., we kept the textures, lighting, camera settings, and background image constant).

Concurrent work. We describe below two concurrent attempts that are closely related but orthogonal to our work. First, Liu et al. [21] proposed a differentiable 3D renderer and used it to perturb both an object’s geometry and the scene’s lighting to cause a DNN to misbehave. However, their geometry perturbations were constrained to be infinitesimal so that the visibility of the vertices would not change. Therefore, their result of minutely perturbing the geometry is effectively similar to that of perturbing textures [6]. In contrast, we performed 3D rotations and 3D translations to move an object inside a 3D space (i.e., the viewing frustum of the camera).

Second, an anonymous ICLR 2019 submission [5] showed how simple rotations and translations of an image can cause DNNs to misclassify. However, these manipulations were still applied to the entire 2D image and thus do not reveal the type of adversarial poses discovered by rotating 3D objects (e.g., a flipped-over school bus; Fig. 1d).

To the best of our knowledge, our work is the first attempt to harness 3D objects to study the OoD poses of well-known training-set objects that cause state-of-the-art ImageNet classifiers and MS COCO detectors to misclassify.

6 Discussion and conclusion

In this paper, we revealed how DNNs’ understanding of objects like “school bus” and “fire truck” is quite naive—they can correctly label only a small subset of the entire pose space for 3D objects. Note that we can also find real-world OoD poses by simply taking photos of real objects (Fig. S17). We believe classifying an arbitrary pose into one of the object classes is an ill-posed task, and that the adversarial pose problem might be alleviated via multiple orthogonal approaches. The first is addressing biased data [42]. Because ImageNet and MS COCO datasets are constructed from photographs taken by people, the datasets reflect the aesthetic tendencies of their captors. Such biases can be somewhat alleviated through data augmentation, specifically, by harnessing images generated from 3D renderers [34, 4]. From the modeling view, we believe DNNs would also benefit from strong 3D geometric priors [4].

Finally, our work introduced a new promising method (Fig. 2) for testing computer vision DNNs by harnessing 3D renderers and 3D models. While we only optimize a single object here, the framework could be extended to jointly optimize lighting, background image, and multiple objects, all in one “adversarial world”. Not only does our framework enable us to enumerate test cases for DNNs, but it also serves as an interpretability tool for extracting useful insights about these black-box models’ inner functions.

Acknowledgement

AN is supported by multiple funds from Auburn University, a donation from Adobe Inc., and computing credits from Amazon AWS.

References

  • [1] Moderngl — moderngl 5.4.1 documentation. https://moderngl.readthedocs.io/en/stable/index.html. (Accessed on 11/14/2018).
  • [2] The self-driving uber that killed a pedestrian didn’t brake. here’s why. https://slate.com/technology/2018/05/uber-car-in-fatal-arizona-crash-perceived-pedestrian-1-3-seconds-before-impact.html. (Accessed on 07/13/2018).
  • [3] Tesla car on autopilot crashes, killing driver, united states news & top stories - the straits times. https://www.straitstimes.com/world/united-states/tesla-car-on-autopilot-crashes-killing-driver. (Accessed on 06/14/2018).
  • [4] H. A. Alhaija, S. K. Mustikovela, A. Geiger, and C. Rother. Geometric image synthesis. arXiv preprint arXiv:1809.04696, 2018.
  • [5] Anonymous. A rotation and a translation suffice: Fooling cnns with simple transformations. In Submitted to International Conference on Learning Representations, 2019. under review.
  • [6] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. In 2018 Proceedings of the 35th International Conference on Machine Learning (ICML), pages 284–293, 2018.
  • [7] N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
  • [8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.
  • [9] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.
  • [10] B. B. Choi and C. Lawrence. Inverse Kinematics Problem in Robotics Using Neural Networks. NASA Technical Memorandum, 105869:1–23, 1992.
  • [11] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song. Robust physical-world attacks on machine learning models. arXiv preprint arXiv:1707.08945, 2017.
  • [12] D. Gandhi, L. Pinto, and A. Gupta. Learning to fly by crashing. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 3948–3955. IEEE, 2017.
  • [13] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [15] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations, 2017.
  • [16] D. Karmon, D. Zoran, and Y. Goldberg. Lavan: Localized and visible adversarial noise. arXiv preprint arXiv:1801.02608, 2018.
  • [17] H. Kato, Y. Ushiku, and T. Harada. Neural 3D Mesh Renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS 2012), pages 1097–1105, 2012.
  • [19] F. Lambert. Understanding the fatal tesla accident on autopilot and the nhtsa probe. Electrek, July, 2016.
  • [20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [21] H.-T. D. Liu, M. Tao, C.-L. Li, D. Nowrouzezahrai, and A. Jacobson. Adversarial Geometry and Lighting using a Differentiable Renderer. arXiv preprint, 8 2018.
  • [22] J. Lu, H. Sibai, E. Fabry, and D. Forsyth. NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles. arXiv preprint, 7 2017.
  • [23] J. Lu, H. Sibai, E. Fabry, and D. A. Forsyth. Standard detectors aren’t (currently) fooled by physical adversarial stop signs. CoRR, abs/1710.03337, 2017.
  • [24] Y. Luo, X. Boix, G. Roig, T. Poggio, and Q. Zhao. Foveation-based Mechanisms Alleviate Adversarial Examples. arXiv preprint, 11 2015.
  • [25] S. Marschner and P. Shirley. Fundamentals of computer graphics. CRC Press, 2015.
  • [26] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, volume 2, page 7, 2017.
  • [27] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015.
  • [28] M. Osadchy, M. L. Miller, and Y. LeCun. Synergistic Face Detection and Pose Estimation with Energy-Based Models. In Advances in Neural Information Processing Systems, pages 1017–1024, 2005.
  • [29] PyTorch. torchvision.models — pytorch master documentation. https://pytorch.org/docs/stable/torchvision/models.html. (Accessed on 11/14/2018).
  • [30] J. Redmon and A. Farhadi. YOLOv3: An Incremental Improvement. 2018.
  • [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [32] C. Sampedro, A. Rodriguez-Ramos, H. Bavle, A. Carrio, P. de la Puente, and P. Campoy. A fully-autonomous aerial robot for search and rescue applications in indoor environments using learning-based techniques. Journal of Intelligent & Robotic Systems, pages 1–27, 2018.
  • [33] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2013.
  • [34] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, volume 2, page 5, 2017.
  • [35] J. Su, D. V. Vargas, and S. Kouichi. One Pixel Attack for Fooling Deep Neural Networks. arXiv preprint, 2017.
  • [36] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
  • [37] M. Sugiyama, N. D. Lawrence, A. Schwaighofer, et al. Dataset shift in machine learning. The MIT Press, 2017.
  • [38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • [39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 12 2016.
  • [40] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • [41] Y. Tian, K. Pei, S. Jana, and B. Ray. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pages 303–314. ACM, 2018.
  • [42] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
  • [43] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL programming guide: the official guide to learning OpenGL, version 1.2. Addison-Wesley Longman Publishing Co., Inc., 1999.
  • [44] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision, pages 160–176. Springer, 2016.
  • [45] X. Yuan, P. He, Q. Zhu, and X. Li. Adversarial Examples: Attacks and Defenses for Deep Learning. arXiv preprint, 2017.
  • [46] Z. Zhu, L. Xie, and A. L. Yuille. Object recognition with and without objects. arXiv preprint arXiv:1611.06596, 2016.

Supplementary materials for:

Strike (with) a Pose: Neural Networks Are Easily Fooled

by Strange Poses of Familiar Objects


S1 Extended description of the 3D object dataset and its evaluation

S1.1 Dataset construction

Classes. Our main dataset consists of 30 unique 3D object models corresponding to 30 ImageNet classes relevant to a traffic environment. The 30 classes include 20 vehicles (e.g., school bus and cab) and 10 street-related items (e.g., traffic light). See Fig. S2 for example renders of each object.

Acquisition. We collected 3D objects and constructed our own datasets for the study. 3D models with high-quality image textures were purchased from turbosquid.com, free3d.com, and cgtrader.com.

To make sure the renders were as close to real ImageNet photos as possible, we used only 3D models that had high-quality 2D image textures. We did not choose 3D models from public datasets, e.g., ObjectNet3D [44], because most of them do not have high-quality image textures. While the renders of such models may be correctly classified by DNNs, we excluded them from our study because of their poor realism. We also examined the ImageNet images to ensure they contained real-world examples qualitatively similar to each 3D object in our 3D dataset.

3D objects. Each 3D object is represented as a mesh, i.e., a list of triangular faces, each defined by three vertices [25]. The 30 meshes have on average 9,908 triangles (see Table S1 for specific numbers).

3D object Tessellated NT Original NO
ambulance 70,228 5,348
backpack 48,251 1,689
bald eagle 63,212 2,950
beach wagon 220,956 2,024
cab 53,776 4,743
cellphone 59,910 502
fire engine 93,105 8,996
forklift 130,455 5,223
garbage truck 97,482 5,778
German shepherd 88,496 88,496
golf cart 98,007 5,153
jean 17,920 17,920
jeep 191,144 2,282
minibus 193,772 1,910
minivan 271,178 1,548
3D object Tessellated NT Original NO
motor scooter 96,638 2,356
moving van 83,712 5,055
park bench 134,162 1,972
parking meter 37,246 1,086
pickup 191,580 2,058
police van 243,132 1,984
recreational vehicle 191,532 1,870
school bus 229,584 6,244
sports car 194,406 2,406
street sign 17,458 17,458
tiger cat 107,431 3,954
tow truck 221,272 5,764
traffic light 392,001 13,840
trailer truck 526,002 5,224
umbrella 71,410 71,410
Table S1: The triangle number for the 30 objects used in our study. NO shows the number of triangles for the original 3D objects, and NT shows the same number after tessellation. Across 30 objects, the average triangle count increases 15x from NO¯=9,908 to NT¯=147,849.

S1.2 Manual object tessellation for experiments using the Differentiable Renderer

In contrast to ModernGL [1]—the non-differentiable renderer (NR) in our paper—the differentiable renderer (DR) by Kato et. al [17] does not perform tessellation, a standard process to increase the resolution of renders. Therefore, the render quality of the DR is lower than that of the NR. To minimize this gap and make results from the NR more comparable with those from the DR, we manually tessellated each 3D object as a pre-processing step for rendering with the DR. Using the manually tessellated objects, we then (1) evaluated the render quality of the DR (Sec. S1.3); and (2) performed research experiments with the DR (i.e., the DR-G method in Sec. 4.4).

Tessellation. We used the Quadify Mesh Modifier feature (quad size of 2%) in 3ds Max 2018 to tessellate objects, increasing the average number of faces 15x from 9,908 to 147,849 (see Table S1). The render quality after tessellation is sharper and of a higher resolution (see Fig. S1a vs. b). Note that the NR pipeline already performs tessellation for every input 3D object. Therefore, we did not perform manual tessellation for 3D objects rendered by the NR.

(a) DR without tessellation     (b) DR with tessellation      (c) NR with tessellation

Figure S1: A comparison of 3D object renders (here, ambulance and school bus) before and after tessellation.
(a) Original 3D models rendered by the differentiable renderer (DR) [17] without tessellation.
(b) DR renderings of the same objects after manual tessellation.
(c) The non-differentiable renderer (NR), i.e., ModernGL [1], renderings of the original objects.
After manual tessellation, the render quality of the DR appears to be sharper (a vs. b) and closely matches that of the NR, which also internally tessellates objects (b vs. c).

S1.3 Evaluation

We recognize that a reality gap will often exist between a render and a real photo. Therefore, we rigorously evaluated our renders to make sure the reality gap was acceptable for our study. From 100 initially-purchased 3D object models, we selected the 30 highest-quality objects that both (1) passed a visual human Turing test; and (2) were correctly recognized with high confidence by the Inception-v3 classifier [38].

S1.3.1 Qualitative evaluation

We did not use the 30 objects chosen for the main dataset (Sec. S1.1) to evaluate the general quality of the DR renderings of high-quality objects on realistic background images. Instead, we randomly chose a separate set of 17 high-quality image-textured objects for evaluation. Using the 17 objects, we generated 56 renders that matched 56 reference (real) photos. Then, we qualitatively evaluated the renders both separately and in a side-by-side comparison with real photos. Specifically, we produced 56 (real photo, render) pairs (see Fig. S3) via the following steps:

  1. 1.

    We retrieved 3 real photos for each 3D object (e.g., a car) from the Internet (using descriptive information, e.g., a car’s make, model, and year).

  2. 2.

    For each real photo, we replaced the object with matching background content via Adobe Photoshop’s Context-Aware Fill-In feature to obtain a background-only photo B (i.e., no foreground objects).

  3. 3.

    We rendered the 3D object with the differentiable renderer on the background B obtained in Step 2. We then manually aligned the pose of the 3D object such that it closely matched that in the reference photo.

  4. 4.

    We evaluated pairs of (photo, render) in a side-by-side comparison.

While discrepancies can be visually spotted in our side-by-side comparisons, we found that most of the renders passed our human visual Turing test if presented alone. That is, it is not easy for humans to tell whether a render is a real photo or not (if they are not primed with the reference photos). We only show pairs rendered by the DR because the NR qualitatively has a slightly higher rendering quality (Fig. S1b vs. c).

S1.3.2 Quantitative evaluation

In addition to the qualitative evaluation, we also quantitatively evaluated the Google Inception-v3 [38]’s top-1 accuracy on renders that use either (a) an empty background or (b) real background images.

a. Evaluation of the renders of 30 objects on an empty background

Because the experiments in the main text used our self-assembled 30-object dataset (Sec. S1.1), we describe the process and the results of our quantitative evaluation for only those objects.

We rendered the objects on a white background with RGB values of (1.0, 1.0, 1.0), an ambient light intensity of 0.9, and a directional light intensity of 0.5. For each object, we sampled 36 unique views (common in ImageNet) evenly divided into three sets. For each set, we set the object at the origin, the up direction to (0,1,0), and the camera position to (0,0,-z) where z={4,6,8}. We sampled 12 views per set by starting the object at a 10 yaw and generating a render at every 30 yaw-rotation. Across all objects and all renders, the Inception-v3 top-1 accuracy is 83.23% (comparable to 77.45% on ImageNet images [38]) with a mean top-1 confidence score of 0.78. The top-1 and top-5 average accuracy and confidence scores are shown in Table S2.

Distance 4 6 8 Average
top-1 mean accuracy 84.2% 84.4% 81.1% 83.2%
top-5 mean accuracy 95.3% 98.6% 96.7% 96.9%
top-1 mean confidence score 0.77 0.80 0.76 0.78
Table S2: The top-1 and top-5 average accuracy and confidence scores for Inception-v3 [38] on the renders of the 30 objects in our dataset.

b. Evaluation of the renders of test objects on real backgrounds

In addition to our qualitative side-by-side (real photo, render) comparisons (Fig. S3), we quantitatively compared Inception-v3’s predictions for our renders to those for real photos. We found a high similarity between real photos and renders for DNN predictions. That is, across all 56 pairs (Sec. S1.3.1), the top-1 predictions match 71.43% of the time. Across all pairs, 76.06% of the top-5 labels for real photos match those for renders.

Figure S2: We tested Inception-v3’s predictions on the renders generated by the differentiable renderer (DR). We show here the top-5 predictions for one random pose per object. However, in total, we generated 36 poses for each object by (1) varying the object distance to the camera; and (2) rotating the object around the yaw axis. See https://goo.gl/7LG3Cy for all the renders and DNN top-5 predictions. Across all 30 objects, on average, Inception-v3 correctly recognizes 83.2% of the renders. See Sec. S1.3.2 for more details.
Figure S3: 12 random pairs of renders (left) and real photos (right) among 56 pairs produced in total for our 3D object rendering evaluation (Sec. S1.3.1). The renders are produced by the differentiable renderer by Kato et al. [17]. More images are available at https://goo.gl/8z42zL. While discrepancies can be spotted in our side-by-side comparisons, we found that most of the renders passed our human visual Turing test if presented alone.
Figure S4: For each object, we collected 30 high-confidence (p0.9) correctly classified images by Inception-v3. The images were generated via the random search procedure. We show here a grid t-SNE of AlexNet [18] 𝖿𝖼𝟩 features for all 30 objects × 30 images = 900 images. Correctly classified images for each object tend to be similar and clustered together. The original, high-resolution figure is available at https://goo.gl/TGgPgB.
To better visualize the clusters, we plotted the same t-SNE but used unique colors to denote the different 3D objects in the renders (Fig. S5). Compare and contrast this plot with the t-SNE of only misclassified poses (Figs. S6S7).
Figure S5: The same t-SNE found in Fig. S4 but using a unique color to denote the 3D object found in each rendered image. Here, each color also corresponds to a unique Inception-v3 label. Compare and contrast this plot with the t-SNE of only misclassified poses (Fig. S7). The original, high-resolution figure is available at https://goo.gl/TGgPgB.
Figure S6: Following the same process as described in Fig. S4, we show here a grid t-SNE of generated adversarial poses. For each object, we assembled 30 high-confidence (p0.9) adversarial examples generated via a random search against Inception-v3 [38]. The t-SNE was generated from the AlexNet [18] 𝖿𝖼𝟩 features for 30 objects × 30 images = 900 images. The original, high-resolution figure is available at https://goo.gl/TGgPgB. Adversarial poses were found to be both common across different objects (e.g., the top-right corner) and unique to specific objects (e.g., the traffic sign and umbrella objects in the middle left).
To better understand how similar misclassified poses can be found across many objects, see Fig. S7. Compare and contrast this plot with the t-SNE of correctly classified poses (Figs. S4S5).
Figure S7: The same t-SNE as that in Fig. S6 but using a unique color to denote the 3D object used to render the adversarial image (i.e., Inception-v3’s misclassification labels are not shown here). The original, high-resolution figure is available at https://goo.gl/TGgPgB.
Compare and contrast this plot with the t-SNE of correctly classified poses (Fig. S5).

S2 Transferability from the Inception-v3 classifier to the YOLO-v3 detector

Previous research has shown that object detectors can be more robust to adversarial attacks than image classifiers [23]. Here, we investigate how well our AXs generated for an Inception-v3 classifier trained to perform 1,000-way image classification on ImageNet [31] transfer to YOLO-v3, a state-of-the-art object detector trained on MS COCO [20].

Note that while ImageNet has 1,000 classes, MS COCO has bounding boxes classified into only 80 classes. Therefore, among 30 objects, we only selected the 13 objects that (1) belong to classes found in both the ImageNet and MS COCO datasets; and (2) are also well recognized by the YOLO-v3 detector in common poses.

S2.1 Class mappings from ImageNet to MS COCO

See Table S3a for 13 mappings from ImageNet labels to MS COCO labels.

S2.2 Selecting 13 objects for the transferability test

For the transferability test (Sec. S2.3), we identified the 13 objects (out of 30) that are well detected by the YOLO-v3 detector via the two tests described below.

S2.2.1 YOLO-v3 correctly classifies 93.80% of poses generated via yaw-rotation

We rendered 36 unique views for each object by generating a render at every 30 yaw-rotation (see Sec. S1.3.2). Note that, across all objects, these yaw-rotation views have an average accuracy of 83.2% by the Inception-v3 classifier. We tested them against YOLO-v3 to see whether the detector was able to correctly find one single object per image and label it correctly. Among 30 objects, we removed those that YOLO-v3 had an accuracy 70%, leaving 13 for the transferability test. Across the remaining 13 objects, YOLO-v3 has an accuracy of 93.80% on average (with an NMS threshold of 0.4 and a confidence threshold of 0.5). Note that the accuracy was computed as the total number of correct labels over the total number of bounding boxes detected (i.e., we did not measure bounding-box IoU errors). See class-specific statistics in Table S3. This result shows that YOLO-v3 is substantially more accurate than Inception-v3 on the standard object poses generated by yaw-rotation (93.80% vs. 83.2%).

S2.2.2 YOLO-v3 correctly classifies 81.03% of poses correctly classified by Inception-v3

Additionally, as a sanity check, we tested whether poses correctly classified by Inception-v3 transfer well to YOLO-v3. For each object, we randomly selected 30 poses that were 100% correctly classified by Inception-v3 with high confidence (p0.9). The images were generated via the random search procedure in the main text experiment (Sec. 3.2). Across the final 13 objects, YOLO-v3 was able to correctly detect one single object per image and label it correctly at a 81.03% accuracy (see Table S3c).

S2.3 Transferability test: YOLO-v3 fails on 75.5% of adversarial poses misclassified by Inception-v3

For each object, we collected 1,350 random adversarial poses (i.e., incorrectly classified by Inception-v3) generated via the random search procedure (Sec. 3.2). Across all 13 objects and all adversarial poses, YOLO-v3 obtained an accuracy of only 24.50% (compared to 81.03% when tested on images correctly classified by Inception-v3). In other words, 75.5% of adversarial poses generated for Inception-v3 also escaped the detection33 3 We were not able to check how many misclassification labels by YOLO-v3 were the same as those by Inception-v3 because only a small set of 80 the MS COCO classes overlap with the 1,000 ImageNet classes. of YOLO-v3 (see Table S3d for class-specific statistics). Our result shows adversarial poses transfer well across tasks (image classification object detection), models (Inception-v3 YOLO-v3), and datasets (ImageNet MS COCO).

(a) Label mapping (b) Accuracy on (c) Accuracy on (d) Accuracy on
yaw-rotation poses random poses adversarial poses
ImageNet MS COCO #/36 acc (%) #/30 acc (%) #/1350 acc (%) Δacc (%)
1 park bench bench 31 86.11 22 73.33 211 15.63 57.70
2 bald eagle bird 34 94.11 24 80.00 597 44.22 35.78
3 school bus bus 36 100.00 18 60.00 4 0.30 69.70
4 beach wagon car 34 94.44 30 100.00 232 17.19 82.81
5 tiger cat cat 26 72.22 25 83.33 181 13.41 69.93
6 German shepherd dog 32 88.89 28 93.33 406 30.07 63.26
7 motor scooter motorcycle 36 100.00 18 60.00 384 28.44 31.56
8 jean person 36 100.00 29 96.67 943 69.85 26.81
9 street sign stop sign 31 86.11 26 86.67 338 25.04 61.15
10 moving van truck 36 100.00 24 80.00 15 1.11 78.89
11 umbrella umbrella 35 97.22 25 83.33 907 67.19 16.15
12 police van car 36 100.00 25 83.33 55 4.07 79.26
13 trailer truck truck 36 100.00 22 73.33 26 1.93 71.41
Average 93.80 81.03 24.50 56.53
Table S3: Adversarial poses generated for a state-of-the-art ImageNet image classifier (here, Inception-v3) transfer well to an MS COCO detector (here, YOLO-v3). The table shows the YOLO-v3 detector’s accuracy on: (b) object poses generated by a standard process of yaw-rotating the object; (c) random poses that are 100% correctly classified by Inception-v3 with high confidence (p0.9); and (d) adversarial poses, i.e., 100% misclassified by Inception-v3.

(a) The mappings of 13 ImageNet classes onto 12 MS COCO classes.
(b) The accuracy (“acc (%)”) of the YOLO-v3 detector on 36 yaw-rotation poses per object.
(c) The accuracy of YOLO-v3 on 30 random poses per object that were correctly classified by Inception-v3.
(d) The accuracy of YOLO-v3 on 1,350 adversarial poses (“acc (%)”) and the differences between c and d (“Δacc (%)”).

S3 Experimental setup for the differentiable renderer

For the gradient descent method (DR-G) that uses the approximate gradients provided by the differentiable renderer [17] (DR), we set up the rendering parameters in the DR to closely match those in the NR. However, there were still subtle discrepancies between the DR and the NR that made the results (DR-G vs. FD-G in Sec. 4.4) not directly comparable. Despite these discrepancies (described below), we still believe the FD gradients are more stable and informative than the DR gradients (i.e., FD-G outperformed DR-G).44 4 In preliminary experiments with only the DR (not the NR), we also empirically found FD-G to be more stable and effective than DR-G (data not shown).

DR setup. For all experiments with the DR, the camera was centered at (0,0,16) with an up direction (0,1,0). The object’s spatial location was constrained such that the object center was always within the frame. The depth values were constrained to be within [-14,14]. Similar to experiments with the NR, we used the 𝗆𝖾𝖽𝗂𝗎𝗆 lighting setting. The ambient light color was set to white with an intensity 1.0, while the directional light was set to white with an intensity 0.4. Fig. S8 shows an example school bus rendered under this 𝗆𝖾𝖽𝗂𝗎𝗆 lighting at different distances.

(a) School bus at (0,0,-14)
(b) School bus at (0,0,0)
(c) School bus at (0,0,14)
Figure S8: School bus rendered by the DR at different distances.

The known discrepancies between the experimental setups of FD-G (with the NR) vs. DR-G (with the DR) are:

  1. 1.

    The exact 𝗆𝖾𝖽𝗂𝗎𝗆 lighting parameters for the NR described in the main text (Sec. 4.1) did not produce similar lighting effects in the DR. Therefore, the DR lighting parameters described above were the result of manually tuning to qualitatively match the effect produced by the NR 𝗆𝖾𝖽𝗂𝗎𝗆 lighting parameters.

  2. 2.

    While the NR uses a built-in tessellation procedure that automatically tessellates input objects before rendering, we had to perform an extra pre-processing step of manually tessellating each object for the DR. While small, a discrepancy still exists between the two rendering results (Fig. S1b vs. c).

S4 Gradient descent with the DR gradients

In preliminary experiments (data not shown), we found the DR gradients to be relatively noisy when using gradient descent to find targeted adversarial poses (i.e., DR-G experiments). To mitigate this problem, we experimented with (1) parameter augmentation (Sec. S4.1); and (2) multi-view optimization (Sec. S4.2). In short, we found parameter augmentation helped and used it in DR-G. However, when using the DR, we did not find multiple cameras improved optimization performance and thus only performed regular single-view optimization for DR-G.

S4.1 Parameter augmentation

We performed gradient descent using the DR gradients (DR-G) in an augmented parameter space corresponding to 50 rotations and one translation to be applied to the original object vertices. That is, we backpropagated the DR gradients into the parameters of these pre-defined transformation matrices. Note that DR-G is given the same budget of 100 steps per optimization run as FD-G and ZRS for comparison in Sec. 4.4.

The final transformation matrix is constructed by a series of rotations followed by one translation, i.e.,

M=TRn-1Rn-2R0

where M is the final transformation matrix, Ri the rotation matrices, and T the translation matrix.

We empirically found that increasing the number of rotations per step helped (a) improve the success rate of hitting the target labels; (b) increase the maximum confidence score of the found AXs; and (c) reduce the number of steps, i.e., led to faster convergence (see Fig. S9). Therefore, we empirically chose n=50 for all DR-G experiments reported in the main text.

(a) y-axis: success rate
(b) y-axis: max confidence
(c) y-axis: mean number of steps
Figure S9: We found that increasing the number of rotations (displayed in x-axes) per step helped:
(a) improve the success rate of hitting the target labels;
(b) increase the maximum confidence score of the found adversarial examples;
(c) reduce the average number of steps required to find an AX, i.e., led to faster convergence.

S4.2 Multi-view optimization

Additionally, we attempted to harness multiple views (from multiple cameras) to increase the chance of finding a target adversarial pose. Multi-view optimization did not outperform single-view optimization using the DR in our experiments. Therefore, we only performed regular single-view optimization for DR-G. We briefly document our negative results below.

Instead of backpropagating the DR gradient to a single camera looking at the object in the 3D scene, one may set up multiple cameras, each looking at the object from a different angle. This strategy intuitively allows gradients to still be backpropagated into the vertices that may be occluded in one view but visible in some other view. We experimented with six cameras and backpropagating to all cameras in each step. However, we only updated the object following the gradient from the view that yielded the lowest loss among all views. One hypothesis is that having multiple cameras might improve the chance of hitting the target.

In our experiments with the DR using 100 steps per optimization run, multi-view optimization performed worse than single-view in terms of both the success rate and the number of steps to converge. We did not compare all 30 objects due to the expensive computational cost, and only report the results from optimizing two objects bald eagle and tiger cat in Table S4. Intuitively, multi-view optimization might outperform single-view optimization given a large enough number of steps.

bald eagle tiger cat
Steps Success rate Steps Success rate
Single-view 71.80 0.44 90.70 0.15
Multi-view 81.28 0.23 96.84 0.04
Table S4: Multi-view optimization performed worse than single-view optimization in both (a) the number of steps to converge and (b) success rates. We show here the results of two runs of optimizing with the bald eagle and tiger cat objects. The results are averaged over 50 target labels ×50 trials = 2,500 trials. Each optimization trial for both single- and multi-view settings is given the budget of 100 steps.

S5 3D Transformation Matrix

A rotation of θ around an arbitrary axis (x,y,z) is given by the following homogeneous transformation matrix.

R=|xx(1-c)+cxy(1-c)-zsxz(1-c)+ys0xy(1-c)+zsyy(1-c)+cyz(1-c)-xs0xz(1-c)-ysyz(1-c)+xsyz(1-c)+c00001| (8)

where s=sinθ, c=cosθ, and the axis is normalized, i.e., x2+y2+z2=1. Translation by a vector (x,y,z) is given by the following homogeneous transformation matrix.

T=|100x010y001z0001| (9)

Note that in the optimization experiments with random search (RS) and finite-difference gradients (FD-G), we dropped the homogeneous component for simplicity, i.e., the rotation matrices of yaw, pitch, and roll are all 3×3. The homogeneous component is only necessary for translation, which can be achieved via simple vector addition. However, in DR-G, we used the homogeneous component because we had some experiments interweaving translation and rotation. The matrix representation was more convenient for the DR-G experiments. As they are mathematically equivalent, this arbitrary implementation choice should not alter our results.

Object Accuracy (%)
ambulance 3.64
backpack 8.63
bald eagle 13.26
beach wagon 0.60
cab 2.64
cell phone 14.97
fire engine 4.31
forklift 5.20
garbage truck 4.88
German shepherd 9.61
Object Accuracy (%)
golfcart 2.14
jean 2.71
jeep 0.29
minibus 0.83
minivan 0.66
motor scooter 20.49
moving van 0.45
park bench 5.72
parking meter 1.27
pickup 0.86
Object Accuracy (%)
police van 0.95
recreational vehicle 2.05
school bus 3.48
sports car 2.50
street sign 26.32
tiger cat 7.36
tow truck 0.87
traffic light 14.95
trailer truck 1.27
umbrella 49.88
Table S5: The percent of three million random samples that were correctly classified by Inception-v3 [38] for each object. That is, for each lighting setting in {𝖻𝗋𝗂𝗀𝗁𝗍,𝗆𝖾𝖽𝗂𝗎𝗆,𝖽𝖺𝗋𝗄}, we generated 106 samples. See Sec. 3.2 for details on the sampling procedure.
(a) 𝖻𝗋𝗂𝗀𝗁𝗍
(b) 𝗆𝖾𝖽𝗂𝗎𝗆
(c) 𝖽𝖺𝗋𝗄
Figure S10: Renders of the school bus object using the NR [1] at three different lighting settings. The directional light intensities and ambient light intensities were (1.2,1.6), (0.4,1.0), and (0.2,0.5) for the 𝖻𝗋𝗂𝗀𝗁𝗍, 𝗆𝖾𝖽𝗂𝗎𝗆, and 𝖽𝖺𝗋𝗄 settings, respectively.

S6 Adversarial poses were not found in ImageNet classes via a nearest-neighbor search

We performed a nearest-neighbor search to check whether adversarial poses generated (in Sec. 4.1) can be found in the ImageNet dataset.


Retrieving nearest neighbors from a single class corresponding to the 3D object. We retrieved the five nearest training-set images for each adversarial pose (taken from a random selection of adversarial poses) using the 𝖿𝖼𝟩 feature space from a pre-trained AlexNet [18]. The Euclidean distance was used to measure the distance between two 𝖿𝖼𝟩 feature vectors. We did not find qualitatively similar images despite comparing all 1,300 class images corresponding to the 3D object used to generate the adversarial poses (e.g., cellphone, school bus, and garbage truck in Figs. S11S12, and S13). This result supports the hypothesis that the generated adversarial poses are out-of-distribution.


Searching from the validation set. We also searched the entire 50,000-image validation set of ImageNet. Interestingly, we found the top-5 nearest images were sometimes from the same class as the targeted misclassification label (see Fig. S19).

Figure S11: For each adversarial example (leftmost), we retrieved the five nearest neighbors (five rightmost photos) from all 1,300 images in the cellular phone class. The Euclidean distance between a pair of images was computed in the 𝖿𝖼𝟩 feature space of a pre-trained AlexNet [18]. The nearest photos from the class are mostly different from the adversarial poses. This result supports the hypothesis that the generated adversarial poses are out-of-distribution. The original, high-resolution figure is available at https://goo.gl/X31VXh.
Figure S12: For each adversarial example (leftmost), we retrieved the five nearest neighbors (five rightmost photos) from all 1,300 images in the school bus class. The Euclidean distance between a pair of images was computed in the 𝖿𝖼𝟩 feature space of a pre-trained AlexNet [18]. The nearest photos from the class are mostly different from the adversarial poses. This result supports the hypothesis that the generated adversarial poses are out-of-distribution. The original, high-resolution figure is available at https://goo.gl/X31VXh.
Figure S13: For each adversarial example (leftmost), we retrieved the five nearest neighbors (five rightmost photos) from all 1,300 images in the garbage truck class. The Euclidean distance between a pair of images was computed in the 𝖿𝖼𝟩 feature space of a pre-trained AlexNet [18]. The nearest photos from the class are mostly different from the adversarial poses. This result supports the hypothesis that the generated adversarial poses are out-of-distribution. The original, high-resolution image is available at https://goo.gl/X31VXh.
Figure S14: In Sec. 4.6, we trained an AlexNet classifier on the 1000-class ImageNet dataset augmented with 30 additional classes that contain adversarial poses corresponding to the 30 known objects used in the main experiments. We also tested this model on 7 held-out objects. Here, we show the renders of 7 pairs of (training-set object, held-out object). The 3D objects are rendered by the NR [1] at a distance of (0,0,4). Below each image is its top-5 predictions by Inception-v3 [38]. The original, high-resolution figure is available at https://goo.gl/Li1eKU.
(a) ambulance
(b) school bus
(c) street sign
Figure S15: 30 random adversarial examples misclassified by Inception-v3 [38] with high confidence (p0.9) generated from 3 objects: ambulance, school bus, and street sign. Below each image is the top-1 prediction label and confidence score. The original, high-resolution figures for all 30 objects are available at https://goo.gl/rvDzjy.
Figure S16: For each target class (e.g., accordion piano), we show five adversarial poses generated from five unique 3D objects. Adversarial poses are interestingly found to be homogeneous for some classes, e.g., safety pin. However, for most classes, the failure modes are heterogeneous. The original, high-resolution figure is available at https://goo.gl/37HYcE.
(a) cellular phone
(b) jeans
(c) street sign
(d) umbrella
Figure S17: Real-world, high-confidence adversarial poses can be found by taking photos from strange angles of a familiar object, here, cellular phone, jeans, street sign, and umbrella. While Inception-v3 [38] can correctly predict the object in canonical poses (the top-left image in each panel), the model misclassified the same objects in unusual poses. Below each image is its top-1 prediction label and confidence score. We took real-world videos of these four objects and extracted these misclassified poses from the videos. The original, high-resolution figures are available at https://goo.gl/zDWcjG.
Figure S18: Inception-v3 [38] is sensitive to single parameter disturbances of object poses that had originally been correctly classified. For each object, we found 100 correctly classified 6D poses via a random sampling procedure (Sec. 4.3). Given each such pose, we re-sampled one parameter (shown on top of each panel, e.g., yaw) 100 times, yielding 100 classifications, while holding the other five pose parameters constant. In each panel, for each object (e.g., ambulance), we show an error plot for all resultant 100×100=10,000 classifications. Each circle denotes the mean misclassification rate (“Fail Rate”) for each object, while the bars enclose one standard deviation. Across all objects, Inception-v3 is more sensitive to changes in yaw, pitch, roll, and depth (“z_delta”) than spatial changes (“x_delta” and “y_delta”).
Figure S19: For each adversarial example (leftmost), we retrieved the five nearest neighbors (five rightmost photos) from the 50,000-image ImageNet validation set. The Euclidean distance between a pair of images was computed in the 𝖿𝖼𝟩 feature space of a pre-trained AlexNet [18]. Below each adversarial example (AX) is its Inception-v3 [38] top-1 prediction label and confidence score. The associated ground-truth ImageNet label is beneath each retrieved photo. Here, we show an interesting, cherry-picked collection of cases where the nearest photos (in the 𝖿𝖼𝟩 feature space) are also qualitatively similar to the reference AX and sometimes come from the exact same class as the AX’s predicted label. More examples are available at https://goo.gl/8ib2PR.