Abstract
Despite excellent performance on stationary test sets, deep neural networks(DNNs) can fail to generalize to outofdistribution (OoD) inputs, includingnatural, nonadversarial ones, which are common in realworld settings. In thispaper, we present a framework for discovering DNN failures that harnesses 3Drenderers and 3D models. That is, we estimate the parameters of a 3D rendererthat cause a target DNN to misbehave in response to the rendered image. Usingour framework and a selfassembled dataset of 3D objects, we investigate thevulnerability of DNNs to OoD poses of wellknown objects in ImageNet. Forobjects that are readily recognized by DNNs in their canonical poses, DNNsincorrectly classify 97% of their pose space. In addition, DNNs are highlysensitive to slight pose perturbations. Importantly, adversarial poses transferacross models and datasets. We find that 99.9% and 99.4% of the posesmisclassified by Inceptionv3 also transfer to the AlexNet and ResNet50 imageclassifiers trained on the same ImageNet dataset, respectively, and 75.5%transfer to the YOLOv3 object detector trained on MS COCO.
Quick Read (beta)
Strike (with) a Pose: Neural Networks Are Easily Fooled
by Strange Poses of Familiar Objects
Abstract
Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to outofdistribution (OoD) inputs, including natural, nonadversarial ones, which are common in realworld settings. In this paper, we present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. That is, we estimate the parameters of a 3D renderer that cause a target DNN to misbehave in response to the rendered image. Using our framework and a selfassembled dataset of 3D objects, we investigate the vulnerability of DNNs to OoD poses of wellknown objects in ImageNet. For objects that are readily recognized by DNNs in their canonical poses, DNNs incorrectly classify 97% of their pose space. In addition, DNNs are highly sensitive to slight pose perturbations. Importantly, adversarial poses transfer across models and datasets. We find that 99.9% and 99.4% of the poses misclassified by Inceptionv3 also transfer to the AlexNet and ResNet50 image classifiers trained on the same ImageNet dataset, respectively, and 75.5% transfer to the YOLOv3 object detector trained on MS COCO.
1 Introduction
For realworld technologies, such as selfdriving cars [9], autonomous drones [12], and searchandrescue robots [32], the test distribution may be nonstationary, and new observations will often be outofdistribution (OoD), i.e., not from the training distribution [37]. However, machine learning (ML) models frequently assign wrong labels with high confidence to OoD examples, such as adversarial examples [40, 27]—inputs specially crafted by an adversary to cause a target model to misbehave. But ML models are also vulnerable to natural OoD examples [19, 2, 41, 3]. For example, when a Tesla autopilot car failed to recognize a white truck against a brightlit sky—an unusual view that might be OoD—it crashed into the truck, killing the driver [3].
To understand such natural Type II classification errors, we searched for 6D poses (i.e., 3D translations and 3D rotations) of 3D objects that caused DNNs to misclassify. Our results reveal that stateoftheart image classifiers and object detectors trained on largescale image datasets [31, 20] misclassify most poses for many familiar trainingset objects. For example, DNNs predict the front view of a school bus—an object in the ImageNet dataset [31]—extremely well (Fig. 1a) but fail to recognize the same object when it is too close or flipped over, i.e., in poses that are OoD yet exist in the real world (Fig. 1d).
Addressing this type of OoD error is a nontrivial challenge. First, objects on roads may appear in an infinite variety of poses [3, 2]. Second, these OoD poses come from known objects and should be assigned known labels rather than being rejected as unknown objects [15, 33]. Moreover, a selfdriving car needs to correctly estimate at least some attributes of an incoming, unknown object (instead of simply rejecting it) to handle the situation gracefully and minimize damage.
In this paper, we propose a framework for finding OoD errors in computer vision models in which iterative optimization in the parameter space of a 3D renderer is used to estimate changes (e.g., in object geometry and appearance, lighting, background, or camera settings) that cause a target DNN to misbehave (Fig. 2). With our framework, we generated unrestricted 6D poses of 3D objects and studied how DNNs respond to 3D translations and 3D rotations of objects. For our study, we built a dataset of 3D objects corresponding to 30 ImageNet classes relevant to the selfdriving car application. All code and data for our experiments will be available at https://github.com/airalcorn2/strikewithapose. In addition, we will release a simple GUI tool that allows users to generate their own adversarial poses of an object.
Our main findings are:

•
ImageNet classifiers only correctly label $3.09\%$ of the entire 6D pose space of a 3D object, and misclassify many generated adversarial examples (AXs) that are humanrecognizable (Fig. 1b–c). A misclassification can be found via a change as small as $10.31\mathrm{\xb0}$, $8.02\mathrm{\xb0}$, and $9.17\mathrm{\xb0}$ to the yaw, pitch, and roll, respectively.

•
99.9% and 99.4% of AXs generated against Inceptionv3 transfer to the AlexNet and ResNet50 image classifiers, respectively, and 75.5% transfer to the YOLOv3 object detector.

•
Training on adversarial poses generated by the 30 objects (in addition to the original ImageNet data) did not help DNNs generalize well to heldout objects in the same class.
In sum, our work shows that stateoftheart DNNs perform image classification well but are still far from true object recognition. While it might be possible to improve DNN robustness through adversarial training with many more 3D objects, we hypothesize that future ML models capable of visual reasoning may instead benefit from strong 3D geometry priors.
2 Framework
2.1 Problem formulation
Let $f$ be an image classifier that maps an image $\mathbf{x}\in {\mathbb{R}}^{H\times W\times C}$ onto a softmax probability distribution over 1,000 output classes [38]. Let $R$ be a 3D renderer that takes as input a set of parameters $\varphi $ and outputs a render, i.e., a 2D image $R(\varphi )\in {\mathbb{R}}^{H\times W\times C}$ (see Fig. 2).
Typically, $\varphi $ is factored into mesh vertices $V$, texture images $T$, a background image $B$, camera parameters $C$, and lighting parameters $L$, i.e., $\varphi =\{V,T,B,C,L\}$ [17]. To change the 6D pose of a given 3D object, we apply a set of 3D rotations and 3D translations, parameterized by $\theta \in {\mathbb{R}}^{6}$, to the original vertices $V$, yielding a new set of vertices ${V}^{*}$.
Here, we wish to estimate only the pose transformation parameters $\mathbf{w}$ (while keeping all parameters in $\varphi $ fixed) such that the rendered image $R(\mathbf{w};\varphi )$ causes the classifier $f$ to assign the highest probability (among all outputs) to an incorrect target output at index $t$. Formally, we attempt to solve the below optimization problem:
$${\mathbf{w}}^{*}=\underset{\mathbf{w}}{\mathrm{arg}\mathrm{max}}({f}_{t}(R(\mathbf{w};\varphi )))$$  (1) 
In practice, we minimize the crossentropy loss $\mathcal{L}$ for the target class. Eq. 1 may be solved efficiently via backpropagation if both $f$ and $R$ are differentiable, i.e., we are able to compute $\partial \mathcal{L}/\partial \mathbf{w}$. However, standard 3D renderers, e.g., OpenGL [43], typically include many nondifferentiable operations and cannot be inverted [25]. Therefore, we attempted two approaches: (1) harnessing a recently proposed differentiable renderer and performing gradient descent using its analytical gradients; and (2) harnessing a nondifferentiable renderer and approximating the gradient via finite differences.
2.2 Classification networks
We chose the wellknown, pretrained Google Inceptionv3 [39] DNN from the PyTorch model zoo [29] as the main image classifier for our study (the default DNN if not otherwise stated). The DNN has a 77.45% top1 accuracy on the ImageNet ILSVRC 2012 dataset [31] of 1.2 million images corresponding to 1,000 categories.
2.3 3D renderers
Nondifferentiable renderer. We chose ModernGL [1] as our nondifferentiable renderer. ModernGL is a simple Python interface for the wellknown OpenGL graphics engine. ModernGL supports fast, GPUaccelerated rendering.
Differentiable renderer. To enable backpropagation through the nondifferentiable rasterization process, Kato et al. [17] replaced the discrete pixel color sampling step with a linear interpolation sampling scheme that admits nonzero gradients. While the approximation enables gradients to flow from the output image back to the renderer parameters $\varphi $, the render quality is lower than that of our nondifferentiable renderer (see Fig. S1 for a comparison). Hereafter, we refer to the two renderers as NR and DR.
2.4 3D object dataset
Construction. Our main dataset consists of 30 unique 3D object models (purchased from many 3D model marketplaces) corresponding to 30 ImageNet classes relevant to a traffic environment (Fig. S2). The 30 classes include 20 vehicles (e.g., school bus and cab) and 10 streetrelated items (e.g., traffic light). See Sec. S1 for more details.
Each 3D object is represented as a mesh, i.e., a list of triangular faces, each defined by three vertices [25]. The 30 meshes have on average 9,908 triangles (Table S1). To maximize the realism of the rendered images, we used only 3D models that have highquality 2D image textures. We did not choose 3D models from public datasets, e.g., ObjectNet3D [44], because most of them do not have highquality image textures. That is, the renders of such models may be correctly classified by DNNs but still have poor realism.
Evaluation. We recognize that a reality gap will often exist between a render and a real photo. Therefore, we rigorously evaluated our renders to make sure the reality gap was acceptable for our study. From $\sim $100 initiallypurchased 3D models, we selected the 30 highestquality models using the evaluation method below.
First, we quantitatively evaluated DNN predictions on the renders. For each object, we sampled 36 unique views (common in ImageNet) evenly divided into three sets. For each set, we set the object at the origin, the up direction to $(0,1,0)$, and the camera position to $(0,0,z)$ where $z=\{4,6,8\}$. We sampled 12 views per set by starting the object at a ${10}^{\circ}$ yaw and generating a render at every ${30}^{\circ}$ yawrotation. Across all objects and all renders, the Inceptionv3 top1 accuracy was $83.23\%$ (compared to $77.45\%$ on ImageNet images [38]) with a mean top1 confidence score of $0.78$ (Table S2). See Sec. S1 for more details.
Second, we qualitatively evaluated the renders by comparing them to real photos. We produced 56 (real photo, render) pairs via three steps: (1) we retrieved real photos of an object (e.g., a car) from the Internet; (2) we replaced the object with matching background content in Adobe Photoshop; and (3) we manually rendered the 3D object on the background such that its pose closely matched that in the reference photo. Fig. S3 shows example (real photo, render) pairs. While discrepancies can be spotted in our sidebyside comparisons, we found that most of the renders passed our human visual Turing test if presented alone.
2.5 Background images
Previous studies have shown that image classifiers may be able to correctly label an image when foreground objects are removed (i.e., based on only the background content) [46]. Because the purpose of our study was to understand how DNNs recognize an object itself, a nonempty background would have hindered our interpretation of the results. Therefore, we rendered all images against a plain background with RGB values of $(0.485,0.456,0.406)$, i.e., the mean pixel of ImageNet images. Note that the presence of a nonempty background should not alter our main qualitative findings in this paper—adversarial poses can be easily found against real background photos (Fig. 1).
3 Methods
We will describe the common pose transformations (Sec. 3.1) used in the main experiments. We were able to experiment with nongradient methods because: (1) the pose transformation space ${\mathbb{R}}^{6}$ that we optimize in is fairly lowdimensional; and (2) although the NR is nondifferentiable, its rendering speed is several orders of magnitude faster than that of DR. In addition, our preliminary results showed that the objective function considered in Eq. 1 is highly nonconvex (see Fig. 4), therefore, it is interesting to compare (1) random search vs. (2) gradient descent using finitedifference (FD) approximated gradients vs. (3) gradient descent using the DR gradients.
3.1 Pose transformations
We used standard computer graphics transformation matrices to change the pose of 3D objects [25]. Specifically, to rotate an object with geometry defined by a set of vertices $V=\{{v}_{i}\}$, we applied the linear transformations in Eq. 2 to each vertex ${v}_{i}\in {\mathbb{R}}^{3}$:
$${v}_{i}^{R}={R}_{y}{R}_{p}{R}_{r}{v}_{i}$$  (2) 
where ${R}_{y}$, ${R}_{p}$, and ${R}_{r}$ are the $3\times 3$ rotation matrices for yaw, pitch, and roll, respectively (the matrices can be found in Sec. S5). We then translated the rotated object by adding a vector $T={\left[\begin{array}{ccc}\hfill {x}_{\delta}\hfill & \hfill {y}_{\delta}\hfill & \hfill {z}_{\delta}\hfill \end{array}\right]}^{\top}$ to each vertex:
$${v}_{i}^{R,T}=T+{v}_{i}^{R}$$  (3) 
In all experiments, the center $c\in {\mathbb{R}}^{3}$ of the object was constrained to be inside a subvolume of the camera viewing frustum. That is, the $x$, $y$, and $z$coordinates of $c$ were within $[s,s],[s,s],[28,0]$, respectively, with $s$ being the maximum value that would keep $c$ within the camera frame. Specifically, $s$ is defined as:
$$s=d\cdot \mathrm{tan}({\theta}_{v})$$  (4) 
where ${\theta}_{v}$ is one half the camera’s angle of view (i.e., $8.213\mathrm{\xb0}$ in our experiments) and $d$ is the absolute value of the difference between the camera’s $z$coordinate and ${z}_{\delta}$.
3.2 Random search
In reinforcement learning problems, random search (RS) can be surprisingly effective compared to more sophisticated methods [36]. For our RS procedure, instead of iteratively following some approximated gradient to solve the optimization problem in Eq. 1, we simply randomly selected a new pose in each iteration. The rotation angles for the matrices in Eq. 2 were uniformly sampled from $(0,2\pi )$. ${x}_{\delta}$, ${y}_{\delta}$, and ${z}_{\delta}$ were also uniformly sampled from the ranges defined in Sec. 3.1.
3.3 ${z}_{\delta}$constrained random search
Our preliminary RS results suggest the value of ${z}_{\delta}$ (which is a proxy for the object’s size in the rendered image) has a large influence on a DNN’s predictions. Based on this observation, we used a ${z}_{\delta}$constrained random (ZRS) search procedure both as an initializer for our gradientbased methods and as a naive performance baseline (for comparisons in Sec. 4.4). The ZRS procedure consisted of generating 10 random samples of $({x}_{\delta},{y}_{\delta},{\theta}_{y},{\theta}_{p},{\theta}_{r})$ at each of 30 evenly spaced ${z}_{\delta}$ from $28$ to $0$.
When using ZRS for initialization, the parameter set with the maximum target probability was selected as the starting point. When using the procedure as an attack method, we first gathered the maximum target probabilities for each ${z}_{\delta}$, and then selected the best two ${z}_{\delta}$ to serve as the new range for RS.
3.4 Gradient descent with finitedifference
We calculated the firstorder derivatives via finite central differences and performed vanilla gradient descent to iteratively minimize the crossentropy loss $\mathcal{L}$ for a target class. That is, for each parameter ${\mathbf{w}}_{i}$, the partial derivative is approximated by:
$$\frac{\partial \mathcal{L}}{\partial {\mathbf{w}}_{i}}=\frac{\mathcal{L}({\mathbf{w}}_{i}+\frac{h}{2})\mathcal{L}({\mathbf{w}}_{i}\frac{h}{2})}{h}$$  (5) 
Although we used an $h$ of 0.001 for all parameters, a different step size can be used per parameter. Because radians have a circular topology (i.e., a rotation of 0 radians is the same as a rotation of $2\pi $ radians, $4\pi $ radians, etc.), we parameterized each rotation angle ${\theta}_{i}$ as $(\mathrm{cos}({\theta}_{i}),\mathrm{sin}({\theta}_{i}))$—a technique commonly used for pose estimation [28] and inverse kinematics [10]—which maps the Cartesian plane to angles via the $atan2$ function. Therefore, we optimized in a space of $3+2\times 3=9$ parameters.
The approximate gradient $\nabla \mathcal{L}$ obtained from Equation (5) served as the gradient in our gradient descent. We used the vanilla gradient descent update rule:
$$\mathbf{w}\u2254\mathbf{w}\gamma \nabla \mathcal{L}(\mathbf{w})$$  (6) 
with a learning rate $\gamma $ of 0.001 for all parameters and optimized for $100$ steps (no other stopping criteria).
4 Experiments and results
4.1 Neural networks are easily confused by object rotations and translations
Experiment. To test DNN robustness to object rotations and translations, we used RS to generate samples for every 3D object in our dataset. In addition, to explore the impact of lighting on DNN performance, we considered three different lighting settings: $\mathrm{\U0001d5bb\U0001d5cb\U0001d5c2\U0001d5c0\U0001d5c1\U0001d5cd}$, $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$, and $\mathrm{\U0001d5bd\U0001d5ba\U0001d5cb\U0001d5c4}$ (example renders in Fig. S10). In all three settings, both the directional light and the ambient light were white in color, i.e., had RGB values of $(1.0,1.0,1.0)$, and the directional light was oriented at $(0,1,0)$ (i.e., pointing straight down). The directional light intensities and ambient light intensities were $(1.2,1.6)$, $(0.4,1.0)$, and $(0.2,0.5)$ for the $\mathrm{\U0001d5bb\U0001d5cb\U0001d5c2\U0001d5c0\U0001d5c1\U0001d5cd}$, $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$, and $\mathrm{\U0001d5bd\U0001d5ba\U0001d5cb\U0001d5c4}$ settings, respectively. All other experiments used the $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ lighting setting.
Misclassifications uniformly cover the pose space. For each object, we calculated the DNN accuracy (i.e., percent of correctly classified samples) across all three lighting settings (Table S5). The DNN was wrong for the vast majority of samples, i.e., the median percent of correct classifications for all 30 objects was only 3.09%. Moreover, highconfidence misclassifications ($p\ge 0.7$) are largely uniformly distributed across every pose parameter (Fig. 2(a)), i.e., AXs can be found throughout the parameter landscape (see Fig. S15 for examples). In contrast, correctly classified examples are highly multimodal w.r.t. the rotation axis angles and heavily biased towards ${z}_{\delta}$ values that are closer to the camera (Fig. 2(b)).
An object can be misclassified as many different labels. Previous research has shown that it is relatively easy to produce AXs corresponding to many different classes when optimizing input images [40] or 3D object textures [6], which are very highdimensional. When finding adversarial poses, one might expect—because all renderer parameters, including the original object geometry and textures, are held constant—the success rate to depend largely on the similarities between a given 3D object and examples of the target in ImageNet. Interestingly, across our 30 objects, RS discovered $990/1000$ different ImageNet classes (132 of which were shared between all objects). When only considering highconfidence ($p\ge 0.7$) misclassifications, our 30 objects were still misclassified into $797$ different classes with a median number of 240 incorrect labels found per object (see Fig. S16 and Fig. S6 for examples). Across all adversarial poses and objects, DNNs tend to be more confident when correct than when wrong (the median of median probabilities were 0.41 vs. 0.21, respectively).
4.2 Common object classifications are shared across different lighting settings
Here, we analyze how our results generalize across different lighting conditions. From the data produced in Sec. 4.1, for each object, we calculated the DNN accuracy under each lighting setting. Then, for each object, we took the absolute difference of the accuracies for all three lighting combinations (i.e., $\mathrm{\U0001d5bb\U0001d5cb\U0001d5c2\U0001d5c0\U0001d5c1\U0001d5cd}$ vs. $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$, $\mathrm{\U0001d5bb\U0001d5cb\U0001d5c2\U0001d5c0\U0001d5c1\U0001d5cd}$ vs. $\mathrm{\U0001d5bd\U0001d5ba\U0001d5cb\U0001d5c4}$, and $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ vs. $\mathrm{\U0001d5bd\U0001d5ba\U0001d5cb\U0001d5c4}$) and recorded the maximum of those values. The median “maximum absolute difference” of accuracies for all objects was 2.29% (compared to the median accuracy of $3.09\%$ across all lighting settings). That is, DNN accuracy is consistently low across all lighting conditions. Lighting changes would not alter the fact that DNNs are vulnerable to adversarial poses.
We also recorded the 50 most frequent classes for each object under the different lighting settings (${S}_{b}$, ${S}_{m}$, and ${S}_{d}$). Then, for each object, we computed the intersection over union score ${o}_{S}$ for these sets:
$${o}_{S}=100\cdot \frac{{S}_{b}\cap {S}_{m}\cap {S}_{d}}{{S}_{b}\cup {S}_{m}\cup {S}_{d}}$$  (7) 
The median ${o}_{S}$ for all objects was 47.10%. That is, for 15 out of 30 objects, 47.10% of the 50 most frequent classes were shared across lighting settings. While lighting does have an impact on DNN misclassifications (as expected), the large number of shared labels across lighting settings suggests ImageNet classes are strongly associated with certain adversarial poses regardless of lighting.
4.3 Correct classifications are highly localized in the rotation and translation landscape
To gain some intuition for how Inceptionv3 responds to rotations and translations of an object, we plotted the probability and classification landscapes for paired parameters (e.g., Fig. 4; pitch vs. roll) while holding the other parameters constant. We qualitatively observed that the DNN’s ability to recognize an object (e.g., a fire truck) in an image varies radically as the object is rotated in the world (Fig. 4).
Experiment. To quantitatively evaluate the DNN’s sensitivity to rotations and translations, we tested how it responded to single parameter disturbances. For each object, we randomly selected 100 distinct starting poses that the DNN had correctly classified in our random sampling runs. Then, for each parameter (e.g., yaw rotation angle), we randomly sampled 100 new values^{1}^{1} 1 using the random sampling procedure described in Sec. 3.2 while holding the others constant. For each sample, we recorded whether or not the object remained correctly classified, and then computed the failure (i.e., misclassification) rate for a given (object, parameter) pair. Plots of the failure rates for all (object, parameter) combinations can be found in Fig. S18.
Additionally, for each parameter, we calculated the median of the median failure rates. That is, for each parameter, we first calculated the median failure rate for all objects, and then calculated the median of those medians for each parameter. Further, for each (object, starting pose, parameter) triple, we recorded the magnitude of the smallest parameter change that resulted in a misclassification. Then, for each (object, parameter) pair, we recorded the median of these minimum values. Finally, we again calculated the median of these medians across objects (Table 1).
Results. As can be seen in Table 1, the DNN is highly sensitive to all single parameter disturbances, but it is especially sensitive to disturbances along the depth (${z}_{\delta}$), pitch (${\theta}_{p}$), and roll (${\theta}_{r}$). Note that a change in rotation as small as $8.02\mathrm{\xb0}$ can cause an object to be misclassified (see Table 1). We also observed that correctly classified poses are highly similar while misclassified poses are diverse by comparing two tSNE plots of these two sets of poses (Fig. S4 vs. Fig. S6).
Parameter  Fail Rate (%)  Min. $\mathrm{\Delta}$ 

${x}_{\delta}$  42  0.11 
${y}_{\delta}$  49  0.09 
${z}_{\delta}$  81  0.69 
${\theta}_{y}$  69  0.18 ($10.31\mathrm{\xb0}$) 
${\theta}_{p}$  83  0.14 ($8.02\mathrm{\xb0}$) 
${\theta}_{r}$  81  0.16 ($9.17\mathrm{\xb0}$) 
4.4 Optimization methods can effectively generate targeted adversarial poses
Given a challenging, highly nonconvex objective landscape (Fig. 4), we wish to evaluate the effectiveness of two different types of approximate gradients at targeted attacks, i.e., finding adversarial examples misclassified as a target class [40]. Here, we compare (1) random search; (2) gradient descent with finitedifference gradients (FDG); and (3) gradient descent with analytical, approximate gradients provided by a differentiable renderer (DRG) [17].
Experiment. Because our adversarial pose attacks are inherently constrained by the fixed geometry and appearances of a given 3D object (see Sec. 4.1), we defined the targets to be the 50 most frequent incorrect classes found by our RS procedure for each object. For each (object, target) pair, we ran 50 optimization trials using ZRS, FDG, and DRG. All treatments were initialized with a pose found by the ZRS procedure and then allowed to optimize for 100 iterations.
Results. For each of the 50 optimization trials, we recorded both whether or not the target was hit and the maximum target probability obtained during the run. For each (object, target) pair, we calculated the percent of target hits and the median maximum confidence score of the target labels (see Table 2). As shown in Table 2, FDG is substantially more effective than ZRS at generating targeted adversarial poses, having both higher median hit rates and confidence scores. In addition, we found the approximate gradients from DR to be surprisingly noisy, and DRG largely underperformed even nongradient methods (ZRS) (see Sec. S4).
Hit Rate (%)  Target Prob.  

ZRS random search  78  0.29 
FDG gradientbased  92  0.41 
DRG${}^{\u2020}$ gradientbased  32  0.22 
4.5 Adversarial poses transfer to different image classifiers and object detectors
The most important property of previously documented AXs is that they transfer across ML models, enabling blackbox attacks [45]. Here, we investigate the transferability of our adversarial poses to (a) two different image classifiers, AlexNet [18] and ResNet50 [14], trained on the same ImageNet dataset; and (b) an object detector YOLOv3 [30] trained on the MS COCO dataset [20].
For each object, we randomly selected 1,350 AXs that were misclassified by Inceptionv3 with high confidence ($p\ge 0.9$) from our untargeted RS experiments in Sec. 4.1. We exposed the AXs to AlexNet and ResNet50 and calculated their misclassification rates. We found that almost all AXs transfer with median misclassification rates of 99.9% and 99.4% for AlexNet and ResNet50, respectively. In addition, 10.1% of AlexNet misclassifications and 27.7% of ResNet50 misclassifications were identical to the Inceptionv3 predicted labels.
There are two orthogonal hypotheses for this result. First, the ImageNet trainingset images themselves may contain a strong bias towards common poses, omitting uncommon poses (Sec. S6 shows supporting evidence from a nearestneighbor test). Second, the models themselves may not be robust to even slight disturbances of the known, indistribution poses.
Object detectors. Previous research has shown that object detectors can be more robust to adversarial attacks than image classifiers [23]. Here, we investigate how well our AXs transfer to a stateoftheart object detector—YOLOv3. YOLOv3 was trained on MS COCO, a dataset of bounding boxes corresponding to 80 different object classes. We only considered the 13 objects that belong to classes present in both the ImageNet and MS COCO datasets. We found that 75.5% of adversarial poses generated for Inceptionv3 are also misclassified by YOLOv3 (see Sec. S2 for more details). These results suggest the adversarial pose problem transfers across datasets, models, and tasks.
4.6 Adversarial training
One of the most effective methods for defending against OoD examples has been adversarial training [13], i.e. augmenting the training set with AXs—also a common approach in anomaly detection [8]. Here, we test whether adversarial training can improve DNN robustness to new poses generated for (1) our 30 trainingset 3D objects; and (2) seven heldout 3D objects.
Training. We augmented the original 1,000class ImageNet dataset with an additional 30 AX classes. Each AX class included 1,350 randomly selected highconfidence ($p\ge 0.9$) misclassified images split 1,300/50 into training/validation sets. Our AlexNet trained on the augmented dataset (AT) achieved a top1 accuracy of 0.565 for the original ImageNet validation set and a top1 accuracy^{2}^{2} 2 In this case, a classification was “correct” if it matched either the original ImageNet positive label or the negative, object label. of 0.967 for the AX validation set.
PT  AT  

Error (T)  99.67  6.7 
Error (H)  99.81  89.2 
Highconfidence Error (T)  87.8  1.9 
Highconfidence Error (H)  48.2  33.3 
Evaluation. To evaluate our AT model vs. a pretrained AlexNet (PT), we used RS to generate ${10}^{6}$ samples for each of our 3D training objects. In addition, we collected seven heldout 3D objects not included in the training set that belong to the same classes as seven trainingset objects (example renders in Fig. S14). We followed the same sampling procedure for the heldout objects to evaluate whether our AT generalizes to unseen objects.
For each of these $30+7=37$ objects and for both the PT and our AT, we recorded two statistics: (1) the percent of misclassifications, i.e. errors; and (2) the percent of highconfidence (i.e., $p\ge 0.7$) misclassifications (Table 3). Following adversarial training, the accuracy of the DNN substantially increased for known objects (Table 3; $99.67\%$ vs. $6.7\%$). However, our AT still misclassified the adversarial poses of heldout objects at an 89.2% error rate.
We hypothesize that augmenting the dataset with many more 3D objects may improve DNN generalization on heldout objects. Here, AT might have used (1) the grey background to separate the 1,000 original ImageNet classes from the 30 AX classes; and (2) some nongeometric features sufficient to discriminate among only 30 objects. However, as suggested by our work (Sec. 2.4), acquiring a largescale, highquality 3D object dataset is costly and laborintensive. Currently, no such public dataset exists, and thus we could not test this hypothesis.
5 Related work
Outofdistribution detection. OoD classes, i.e., classes not found in the training set, present a significant challenge for computer vision technologies in realworld settings [33]. Here, we study an orthogonal problem—correctly classifying OoD poses of objects from known classes. While rejecting to classify is a common approach for handling OoD examples [15, 33], the OoD poses in our work come from known classes and thus should be assigned correct labels.
2D adversarial examples. Numerous techniques for crafting AXs that fool image classifiers have been discovered [45]. However, previous work has typically optimized in the 2D input space [45], e.g., by synthesizing an entire image [27], a small patch [16, 11], a few pixels [7], or only a single pixel [35]. But pixelwise changes are uncorrelated [26], so pixelbased attacks may not transfer well to the real world [22, 24] because there is an infinitesimal chance that such specifically crafted, uncorrelated pixels will be encountered in the vast physical space of camera, lighting, traffic, and weather configurations.
3D adversarial examples. Athalye et al. [6] used a 3D renderer to synthesize textures for a 3D object such that, under a wide range of camera views, the object was still rendered into an effective AX. We also used 3D renderers, but instead of optimizing textures, we optimized the poses of known objects to cause DNNs to misclassify (i.e., we kept the textures, lighting, camera settings, and background image constant).
Concurrent work. We describe below two concurrent attempts that are closely related but orthogonal to our work. First, Liu et al. [21] proposed a differentiable 3D renderer and used it to perturb both an object’s geometry and the scene’s lighting to cause a DNN to misbehave. However, their geometry perturbations were constrained to be infinitesimal so that the visibility of the vertices would not change. Therefore, their result of minutely perturbing the geometry is effectively similar to that of perturbing textures [6]. In contrast, we performed 3D rotations and 3D translations to move an object inside a 3D space (i.e., the viewing frustum of the camera).
Second, an anonymous ICLR 2019 submission [5] showed how simple rotations and translations of an image can cause DNNs to misclassify. However, these manipulations were still applied to the entire 2D image and thus do not reveal the type of adversarial poses discovered by rotating 3D objects (e.g., a flippedover school bus; Fig. 1d).
To the best of our knowledge, our work is the first attempt to harness 3D objects to study the OoD poses of wellknown trainingset objects that cause stateoftheart ImageNet classifiers and MS COCO detectors to misclassify.
6 Discussion and conclusion
In this paper, we revealed how DNNs’ understanding of objects like “school bus” and “fire truck” is quite naive—they can correctly label only a small subset of the entire pose space for 3D objects. Note that we can also find realworld OoD poses by simply taking photos of real objects (Fig. S17). We believe classifying an arbitrary pose into one of the object classes is an illposed task, and that the adversarial pose problem might be alleviated via multiple orthogonal approaches. The first is addressing biased data [42]. Because ImageNet and MS COCO datasets are constructed from photographs taken by people, the datasets reflect the aesthetic tendencies of their captors. Such biases can be somewhat alleviated through data augmentation, specifically, by harnessing images generated from 3D renderers [34, 4]. From the modeling view, we believe DNNs would also benefit from strong 3D geometric priors [4].
Finally, our work introduced a new promising method (Fig. 2) for testing computer vision DNNs by harnessing 3D renderers and 3D models. While we only optimize a single object here, the framework could be extended to jointly optimize lighting, background image, and multiple objects, all in one “adversarial world”. Not only does our framework enable us to enumerate test cases for DNNs, but it also serves as an interpretability tool for extracting useful insights about these blackbox models’ inner functions.
Acknowledgement
AN is supported by multiple funds from Auburn University, a donation from Adobe Inc., and computing credits from Amazon AWS.
References
 [1] Moderngl — moderngl 5.4.1 documentation. https://moderngl.readthedocs.io/en/stable/index.html. (Accessed on 11/14/2018).
 [2] The selfdriving uber that killed a pedestrian didn’t brake. here’s why. https://slate.com/technology/2018/05/ubercarinfatalarizonacrashperceivedpedestrian13secondsbeforeimpact.html. (Accessed on 07/13/2018).
 [3] Tesla car on autopilot crashes, killing driver, united states news & top stories  the straits times. https://www.straitstimes.com/world/unitedstates/teslacaronautopilotcrasheskillingdriver. (Accessed on 06/14/2018).
 [4] H. A. Alhaija, S. K. Mustikovela, A. Geiger, and C. Rother. Geometric image synthesis. arXiv preprint arXiv:1809.04696, 2018.
 [5] Anonymous. A rotation and a translation suffice: Fooling cnns with simple transformations. In Submitted to International Conference on Learning Representations, 2019. under review.
 [6] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Synthesizing robust adversarial examples. In 2018 Proceedings of the 35th International Conference on Machine Learning (ICML), pages 284–293, 2018.
 [7] N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
 [8] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.
 [9] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.
 [10] B. B. Choi and C. Lawrence. Inverse Kinematics Problem in Robotics Using Neural Networks. NASA Technical Memorandum, 105869:1–23, 1992.
 [11] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song. Robust physicalworld attacks on machine learning models. arXiv preprint arXiv:1707.08945, 2017.
 [12] D. Gandhi, L. Pinto, and A. Gupta. Learning to fly by crashing. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 3948–3955. IEEE, 2017.
 [13] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [15] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and outofdistribution examples in neural networks. In Proceedings of International Conference on Learning Representations, 2017.
 [16] D. Karmon, D. Zoran, and Y. Goldberg. Lavan: Localized and visible adversarial noise. arXiv preprint arXiv:1801.02608, 2018.
 [17] H. Kato, Y. Ushiku, and T. Harada. Neural 3D Mesh Renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems (NIPS 2012), pages 1097–1105, 2012.
 [19] F. Lambert. Understanding the fatal tesla accident on autopilot and the nhtsa probe. Electrek, July, 2016.
 [20] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [21] H.T. D. Liu, M. Tao, C.L. Li, D. Nowrouzezahrai, and A. Jacobson. Adversarial Geometry and Lighting using a Differentiable Renderer. arXiv preprint, 8 2018.
 [22] J. Lu, H. Sibai, E. Fabry, and D. Forsyth. NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles. arXiv preprint, 7 2017.
 [23] J. Lu, H. Sibai, E. Fabry, and D. A. Forsyth. Standard detectors aren’t (currently) fooled by physical adversarial stop signs. CoRR, abs/1710.03337, 2017.
 [24] Y. Luo, X. Boix, G. Roig, T. Poggio, and Q. Zhao. Foveationbased Mechanisms Alleviate Adversarial Examples. arXiv preprint, 11 2015.
 [25] S. Marschner and P. Shirley. Fundamentals of computer graphics. CRC Press, 2015.
 [26] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, volume 2, page 7, 2017.
 [27] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 427–436, 2015.
 [28] M. Osadchy, M. L. Miller, and Y. LeCun. Synergistic Face Detection and Pose Estimation with EnergyBased Models. In Advances in Neural Information Processing Systems, pages 1017–1024, 2005.
 [29] PyTorch. torchvision.models — pytorch master documentation. https://pytorch.org/docs/stable/torchvision/models.html. (Accessed on 11/14/2018).
 [30] J. Redmon and A. Farhadi. YOLOv3: An Incremental Improvement. 2018.
 [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [32] C. Sampedro, A. RodriguezRamos, H. Bavle, A. Carrio, P. de la Puente, and P. Campoy. A fullyautonomous aerial robot for search and rescue applications in indoor environments using learningbased techniques. Journal of Intelligent & Robotic Systems, pages 1–27, 2018.
 [33] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2013.
 [34] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, volume 2, page 5, 2017.
 [35] J. Su, D. V. Vargas, and S. Kouichi. One Pixel Attack for Fooling Deep Neural Networks. arXiv preprint, 2017.
 [36] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
 [37] M. Sugiyama, N. D. Lawrence, A. Schwaighofer, et al. Dataset shift in machine learning. The MIT Press, 2017.
 [38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
 [39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 12 2016.
 [40] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
 [41] Y. Tian, K. Pei, S. Jana, and B. Ray. Deeptest: Automated testing of deepneuralnetworkdriven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, pages 303–314. ACM, 2018.
 [42] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE, 2011.
 [43] M. Woo, J. Neider, T. Davis, and D. Shreiner. OpenGL programming guide: the official guide to learning OpenGL, version 1.2. AddisonWesley Longman Publishing Co., Inc., 1999.
 [44] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision, pages 160–176. Springer, 2016.
 [45] X. Yuan, P. He, Q. Zhu, and X. Li. Adversarial Examples: Attacks and Defenses for Deep Learning. arXiv preprint, 2017.
 [46] Z. Zhu, L. Xie, and A. L. Yuille. Object recognition with and without objects. arXiv preprint arXiv:1611.06596, 2016.
Supplementary materials for:
Strike (with) a Pose: Neural Networks Are Easily Fooled
by Strange Poses of Familiar Objects
S1 Extended description of the 3D object dataset and its evaluation
S1.1 Dataset construction
Classes. Our main dataset consists of 30 unique 3D object models corresponding to 30 ImageNet classes relevant to a traffic environment. The 30 classes include 20 vehicles (e.g., school bus and cab) and 10 streetrelated items (e.g., traffic light). See Fig. S2 for example renders of each object.
Acquisition. We collected 3D objects and constructed our own datasets for the study. 3D models with highquality image textures were purchased from turbosquid.com, free3d.com, and cgtrader.com.
To make sure the renders were as close to real ImageNet photos as possible, we used only 3D models that had highquality 2D image textures. We did not choose 3D models from public datasets, e.g., ObjectNet3D [44], because most of them do not have highquality image textures. While the renders of such models may be correctly classified by DNNs, we excluded them from our study because of their poor realism. We also examined the ImageNet images to ensure they contained realworld examples qualitatively similar to each 3D object in our 3D dataset.
3D objects. Each 3D object is represented as a mesh, i.e., a list of triangular faces, each defined by three vertices [25]. The 30 meshes have on average $9,908$ triangles (see Table S1 for specific numbers).
3D object  Tessellated ${N}_{T}$  Original ${N}_{O}$ 

ambulance  70,228  5,348 
backpack  48,251  1,689 
bald eagle  63,212  2,950 
beach wagon  220,956  2,024 
cab  53,776  4,743 
cellphone  59,910  502 
fire engine  93,105  8,996 
forklift  130,455  5,223 
garbage truck  97,482  5,778 
German shepherd  88,496  88,496 
golf cart  98,007  5,153 
jean  17,920  17,920 
jeep  191,144  2,282 
minibus  193,772  1,910 
minivan  271,178  1,548 
3D object  Tessellated ${N}_{T}$  Original ${N}_{O}$ 

motor scooter  96,638  2,356 
moving van  83,712  5,055 
park bench  134,162  1,972 
parking meter  37,246  1,086 
pickup  191,580  2,058 
police van  243,132  1,984 
recreational vehicle  191,532  1,870 
school bus  229,584  6,244 
sports car  194,406  2,406 
street sign  17,458  17,458 
tiger cat  107,431  3,954 
tow truck  221,272  5,764 
traffic light  392,001  13,840 
trailer truck  526,002  5,224 
umbrella  71,410  71,410 
S1.2 Manual object tessellation for experiments using the Differentiable Renderer
In contrast to ModernGL [1]—the nondifferentiable renderer (NR) in our paper—the differentiable renderer (DR) by Kato et. al [17] does not perform tessellation, a standard process to increase the resolution of renders.
Therefore, the render quality of the DR is lower than that of the NR.
To minimize this gap and make results from the NR more comparable with those from the DR, we manually tessellated each 3D object as a preprocessing step for rendering with the DR.
Using the manually tessellated objects, we then (1) evaluated the render quality of the DR (Sec. S1.3); and (2) performed research experiments with the DR (i.e., the DRG method in Sec. 4.4).
Tessellation. We used the Quadify Mesh Modifier feature (quad size of 2%) in 3ds Max 2018 to tessellate objects, increasing the average number of faces $\sim $15x from $9,908$ to $147,849$ (see Table S1). The render quality after tessellation is sharper and of a higher resolution (see Fig. S1a vs. b). Note that the NR pipeline already performs tessellation for every input 3D object. Therefore, we did not perform manual tessellation for 3D objects rendered by the NR.
S1.3 Evaluation
We recognize that a reality gap will often exist between a render and a real photo. Therefore, we rigorously evaluated our renders to make sure the reality gap was acceptable for our study. From $\sim $100 initiallypurchased 3D object models, we selected the 30 highestquality objects that both (1) passed a visual human Turing test; and (2) were correctly recognized with high confidence by the Inceptionv3 classifier [38].
S1.3.1 Qualitative evaluation
We did not use the 30 objects chosen for the main dataset (Sec. S1.1) to evaluate the general quality of the DR renderings of highquality objects on realistic background images. Instead, we randomly chose a separate set of 17 highquality imagetextured objects for evaluation. Using the 17 objects, we generated 56 renders that matched 56 reference (real) photos. Then, we qualitatively evaluated the renders both separately and in a sidebyside comparison with real photos. Specifically, we produced 56 (real photo, render) pairs (see Fig. S3) via the following steps:

1.
We retrieved $\sim $3 real photos for each 3D object (e.g., a car) from the Internet (using descriptive information, e.g., a car’s make, model, and year).

2.
For each real photo, we replaced the object with matching background content via Adobe Photoshop’s ContextAware FillIn feature to obtain a backgroundonly photo $B$ (i.e., no foreground objects).

3.
We rendered the 3D object with the differentiable renderer on the background $B$ obtained in Step 2. We then manually aligned the pose of the 3D object such that it closely matched that in the reference photo.

4.
We evaluated pairs of (photo, render) in a sidebyside comparison.
While discrepancies can be visually spotted in our sidebyside comparisons, we found that most of the renders passed our human visual Turing test if presented alone. That is, it is not easy for humans to tell whether a render is a real photo or not (if they are not primed with the reference photos). We only show pairs rendered by the DR because the NR qualitatively has a slightly higher rendering quality (Fig. S1b vs. c).
S1.3.2 Quantitative evaluation
In addition to the qualitative evaluation, we also quantitatively evaluated the Google Inceptionv3 [38]’s top1 accuracy on renders that use either (a) an empty background or (b) real background images.
a. Evaluation of the renders of 30 objects on an empty background
Because the experiments in the main text used our selfassembled 30object dataset (Sec. S1.1), we describe the process and the results of our quantitative evaluation for only those objects.
We rendered the objects on a white background with RGB values of (1.0, 1.0, 1.0), an ambient light intensity of 0.9, and a directional light intensity of 0.5. For each object, we sampled 36 unique views (common in ImageNet) evenly divided into three sets. For each set, we set the object at the origin, the up direction to $(0,1,0)$, and the camera position to $(0,0,z)$ where $z=\{4,6,8\}$. We sampled 12 views per set by starting the object at a ${10}^{\circ}$ yaw and generating a render at every ${30}^{\circ}$ yawrotation. Across all objects and all renders, the Inceptionv3 top1 accuracy is $83.23\%$ (comparable to $77.45\%$ on ImageNet images [38]) with a mean top1 confidence score of $0.78$. The top1 and top5 average accuracy and confidence scores are shown in Table S2.
Distance  4  6  8  Average 

top1 mean accuracy  84.2%  84.4%  81.1%  83.2% 
top5 mean accuracy  95.3%  98.6%  96.7%  96.9% 
top1 mean confidence score  0.77  0.80  0.76  0.78 
b. Evaluation of the renders of test objects on real backgrounds
In addition to our qualitative sidebyside (real photo, render) comparisons (Fig. S3), we quantitatively compared Inceptionv3’s predictions for our renders to those for real photos. We found a high similarity between real photos and renders for DNN predictions. That is, across all 56 pairs (Sec. S1.3.1), the top1 predictions match 71.43% of the time. Across all pairs, 76.06% of the top5 labels for real photos match those for renders.
S2 Transferability from the Inceptionv3 classifier to the YOLOv3 detector
Previous research has shown that object detectors can be more robust to adversarial attacks than image classifiers [23]. Here, we investigate how well our AXs generated for an Inceptionv3 classifier trained to perform 1,000way image classification on ImageNet [31] transfer to YOLOv3, a stateoftheart object detector trained on MS COCO [20].
Note that while ImageNet has 1,000 classes, MS COCO has bounding boxes classified into only 80 classes. Therefore, among 30 objects, we only selected the 13 objects that (1) belong to classes found in both the ImageNet and MS COCO datasets; and (2) are also well recognized by the YOLOv3 detector in common poses.
S2.1 Class mappings from ImageNet to MS COCO
See Table S3a for 13 mappings from ImageNet labels to MS COCO labels.
S2.2 Selecting 13 objects for the transferability test
For the transferability test (Sec. S2.3), we identified the 13 objects (out of 30) that are well detected by the YOLOv3 detector via the two tests described below.
S2.2.1 YOLOv3 correctly classifies 93.80% of poses generated via yawrotation
We rendered 36 unique views for each object by generating a render at every ${30}^{\circ}$ yawrotation (see Sec. S1.3.2). Note that, across all objects, these yawrotation views have an average accuracy of $83.2\%$ by the Inceptionv3 classifier. We tested them against YOLOv3 to see whether the detector was able to correctly find one single object per image and label it correctly. Among 30 objects, we removed those that YOLOv3 had an accuracy $\le 70\%$, leaving 13 for the transferability test. Across the remaining 13 objects, YOLOv3 has an accuracy of 93.80% on average (with an NMS threshold of $0.4$ and a confidence threshold of $0.5$). Note that the accuracy was computed as the total number of correct labels over the total number of bounding boxes detected (i.e., we did not measure boundingbox IoU errors). See classspecific statistics in Table S3. This result shows that YOLOv3 is substantially more accurate than Inceptionv3 on the standard object poses generated by yawrotation (93.80% vs. 83.2%).
S2.2.2 YOLOv3 correctly classifies 81.03% of poses correctly classified by Inceptionv3
Additionally, as a sanity check, we tested whether poses correctly classified by Inceptionv3 transfer well to YOLOv3. For each object, we randomly selected 30 poses that were $100\%$ correctly classified by Inceptionv3 with high confidence ($p\ge 0.9$). The images were generated via the random search procedure in the main text experiment (Sec. 3.2). Across the final 13 objects, YOLOv3 was able to correctly detect one single object per image and label it correctly at a 81.03% accuracy (see Table S3c).
S2.3 Transferability test: YOLOv3 fails on 75.5% of adversarial poses misclassified by Inceptionv3
For each object, we collected 1,350 random adversarial poses (i.e., incorrectly classified by Inceptionv3) generated via the random search procedure (Sec. 3.2). Across all 13 objects and all adversarial poses, YOLOv3 obtained an accuracy of only $24.50\%$ (compared to $81.03\%$ when tested on images correctly classified by Inceptionv3). In other words, 75.5% of adversarial poses generated for Inceptionv3 also escaped the detection^{3}^{3} 3 We were not able to check how many misclassification labels by YOLOv3 were the same as those by Inceptionv3 because only a small set of 80 the MS COCO classes overlap with the 1,000 ImageNet classes. of YOLOv3 (see Table S3d for classspecific statistics). Our result shows adversarial poses transfer well across tasks (image classification $\to $ object detection), models (Inceptionv3 $\to $ YOLOv3), and datasets (ImageNet $\to $ MS COCO).
(a) Label mapping  (b) Accuracy on  (c) Accuracy on  (d) Accuracy on  

yawrotation poses  random poses  adversarial poses  
ImageNet  MS COCO  #/36  acc (%)  #/30  acc (%)  #/1350  acc (%)  $\mathrm{\Delta}$acc (%)  
1  park bench  bench  31  86.11  22  73.33  211  15.63  57.70 
2  bald eagle  bird  34  94.11  24  80.00  597  44.22  35.78 
3  school bus  bus  36  100.00  18  60.00  4  0.30  69.70 
4  beach wagon  car  34  94.44  30  100.00  232  17.19  82.81 
5  tiger cat  cat  26  72.22  25  83.33  181  13.41  69.93 
6  German shepherd  dog  32  88.89  28  93.33  406  30.07  63.26 
7  motor scooter  motorcycle  36  100.00  18  60.00  384  28.44  31.56 
8  jean  person  36  100.00  29  96.67  943  69.85  26.81 
9  street sign  stop sign  31  86.11  26  86.67  338  25.04  61.15 
10  moving van  truck  36  100.00  24  80.00  15  1.11  78.89 
11  umbrella  umbrella  35  97.22  25  83.33  907  67.19  16.15 
12  police van  car  36  100.00  25  83.33  55  4.07  79.26 
13  trailer truck  truck  36  100.00  22  73.33  26  1.93  71.41 
Average  93.80  81.03  24.50  56.53 
(a) The mappings of 13 ImageNet classes onto 12 MS COCO classes.
(b) The accuracy (“acc (%)”) of the YOLOv3 detector on 36 yawrotation poses per object.
(c) The accuracy of YOLOv3 on 30 random poses per object that were correctly classified by Inceptionv3.
(d) The accuracy of YOLOv3 on 1,350 adversarial poses (“acc (%)”) and the differences between c and d (“$\mathrm{\Delta}$acc (%)”).
S3 Experimental setup for the differentiable renderer
For the gradient descent method (DRG) that uses the approximate gradients provided by the differentiable renderer [17] (DR), we set up the rendering parameters in the DR to closely match those in the NR.
However, there were still subtle discrepancies between the DR and the NR that made the results (DRG vs. FDG in Sec. 4.4) not directly comparable.
Despite these discrepancies (described below), we still believe the FD gradients are more stable and informative than the DR gradients (i.e., FDG outperformed DRG).^{4}^{4}
4
In preliminary experiments with only the DR (not the NR), we also empirically found FDG to be more stable and effective than DRG (data not shown).
DR setup. For all experiments with the DR, the camera was centered at $(0,0,16)$ with an up direction $(0,1,0)$. The object’s spatial location was constrained such that the object center was always within the frame. The depth values were constrained to be within $[14,14]$. Similar to experiments with the NR, we used the $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ lighting setting. The ambient light color was set to white with an intensity 1.0, while the directional light was set to white with an intensity 0.4. Fig. S8 shows an example school bus rendered under this $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ lighting at different distances.
The known discrepancies between the experimental setups of FDG (with the NR) vs. DRG (with the DR) are:

1.
The exact $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ lighting parameters for the NR described in the main text (Sec. 4.1) did not produce similar lighting effects in the DR. Therefore, the DR lighting parameters described above were the result of manually tuning to qualitatively match the effect produced by the NR $\mathrm{\U0001d5c6\U0001d5be\U0001d5bd\U0001d5c2\U0001d5ce\U0001d5c6}$ lighting parameters.

2.
While the NR uses a builtin tessellation procedure that automatically tessellates input objects before rendering, we had to perform an extra preprocessing step of manually tessellating each object for the DR. While small, a discrepancy still exists between the two rendering results (Fig. S1b vs. c).
S4 Gradient descent with the DR gradients
In preliminary experiments (data not shown), we found the DR gradients to be relatively noisy when using gradient descent to find targeted adversarial poses (i.e., DRG experiments). To mitigate this problem, we experimented with (1) parameter augmentation (Sec. S4.1); and (2) multiview optimization (Sec. S4.2). In short, we found parameter augmentation helped and used it in DRG. However, when using the DR, we did not find multiple cameras improved optimization performance and thus only performed regular singleview optimization for DRG.
S4.1 Parameter augmentation
We performed gradient descent using the DR gradients (DRG) in an augmented parameter space corresponding to 50 rotations and one translation to be applied to the original object vertices. That is, we backpropagated the DR gradients into the parameters of these predefined transformation matrices. Note that DRG is given the same budget of $100$ steps per optimization run as FDG and ZRS for comparison in Sec. 4.4.
The final transformation matrix is constructed by a series of rotations followed by one translation, i.e.,
$M=T\cdot {R}_{n1}{R}_{n2}\mathrm{\cdots}{R}_{0}$ 
where $M$ is the final transformation matrix, ${R}_{i}$ the rotation matrices, and $T$ the translation matrix.
We empirically found that increasing the number of rotations per step helped (a) improve the success rate of hitting the target labels; (b) increase the maximum confidence score of the found AXs; and (c) reduce the number of steps, i.e., led to faster convergence (see Fig. S9). Therefore, we empirically chose $n=50$ for all DRG experiments reported in the main text.
(a) improve the success rate of hitting the target labels;
(b) increase the maximum confidence score of the found adversarial examples;
(c) reduce the average number of steps required to find an AX, i.e., led to faster convergence.
S4.2 Multiview optimization
Additionally, we attempted to harness multiple views (from multiple cameras) to increase the chance of finding a target adversarial pose. Multiview optimization did not outperform singleview optimization using the DR in our experiments. Therefore, we only performed regular singleview optimization for DRG. We briefly document our negative results below.
Instead of backpropagating the DR gradient to a single camera looking at the object in the 3D scene, one may set up multiple cameras, each looking at the object from a different angle. This strategy intuitively allows gradients to still be backpropagated into the vertices that may be occluded in one view but visible in some other view. We experimented with six cameras and backpropagating to all cameras in each step. However, we only updated the object following the gradient from the view that yielded the lowest loss among all views. One hypothesis is that having multiple cameras might improve the chance of hitting the target.
In our experiments with the DR using 100 steps per optimization run, multiview optimization performed worse than singleview in terms of both the success rate and the number of steps to converge. We did not compare all 30 objects due to the expensive computational cost, and only report the results from optimizing two objects bald eagle and tiger cat in Table S4. Intuitively, multiview optimization might outperform singleview optimization given a large enough number of steps.
bald eagle  tiger cat  

Steps  Success rate  Steps  Success rate  
Singleview  71.80  0.44  90.70  0.15 
Multiview  81.28  0.23  96.84  0.04 
S5 3D Transformation Matrix
A rotation of $\theta $ around an arbitrary axis $(x,y,z)$ is given by the following homogeneous transformation matrix.
$$R=\left\begin{array}{cccc}xx(1c)+c\hfill & xy(1c)zs\hfill & xz(1c)+ys\hfill & 0\hfill \\ xy(1c)+zs\hfill & yy(1c)+c\hfill & yz(1c)xs\hfill & 0\hfill \\ xz(1c)ys\hfill & yz(1c)+xs\hfill & yz(1c)+c\hfill & 0\hfill \\ 0\hfill & 0\hfill & 0\hfill & 1\hfill \end{array}\right$$  (8) 
where $s=\mathrm{sin}\theta $, $c=\mathrm{cos}\theta $, and the axis is normalized, i.e., ${x}^{2}+{y}^{2}+{z}^{2}=1$. Translation by a vector $(x,y,z)$ is given by the following homogeneous transformation matrix.
$$T=\left\begin{array}{cccc}1\hfill & 0\hfill & 0\hfill & x\hfill \\ 0\hfill & 1\hfill & 0\hfill & y\hfill \\ 0\hfill & 0\hfill & 1\hfill & z\hfill \\ 0\hfill & 0\hfill & 0\hfill & 1\hfill \end{array}\right$$  (9) 
Note that in the optimization experiments with random search (RS) and finitedifference gradients (FDG), we dropped the homogeneous component for simplicity, i.e., the rotation matrices of yaw, pitch, and roll are all $3\times 3$. The homogeneous component is only necessary for translation, which can be achieved via simple vector addition. However, in DRG, we used the homogeneous component because we had some experiments interweaving translation and rotation. The matrix representation was more convenient for the DRG experiments. As they are mathematically equivalent, this arbitrary implementation choice should not alter our results.
Object  Accuracy (%) 

ambulance  3.64 
backpack  8.63 
bald eagle  13.26 
beach wagon  0.60 
cab  2.64 
cell phone  14.97 
fire engine  4.31 
forklift  5.20 
garbage truck  4.88 
German shepherd  9.61 
Object  Accuracy (%) 

golfcart  2.14 
jean  2.71 
jeep  0.29 
minibus  0.83 
minivan  0.66 
motor scooter  20.49 
moving van  0.45 
park bench  5.72 
parking meter  1.27 
pickup  0.86 
Object  Accuracy (%) 

police van  0.95 
recreational vehicle  2.05 
school bus  3.48 
sports car  2.50 
street sign  26.32 
tiger cat  7.36 
tow truck  0.87 
traffic light  14.95 
trailer truck  1.27 
umbrella  49.88 
S6 Adversarial poses were not found in ImageNet classes via a nearestneighbor search
We performed a nearestneighbor search to check whether adversarial poses generated (in Sec. 4.1) can be found in the ImageNet dataset.
Retrieving nearest neighbors from a single class corresponding to the 3D object. We retrieved the five nearest trainingset images for each adversarial pose (taken from a random selection of adversarial poses) using the $\mathrm{\U0001d5bf\U0001d5bc\U0001d7e9}$ feature space from a pretrained AlexNet [18]. The Euclidean distance was used to measure the distance between two $\mathrm{\U0001d5bf\U0001d5bc\U0001d7e9}$ feature vectors. We did not find qualitatively similar images despite comparing all $\sim $1,300 class images corresponding to the 3D object used to generate the adversarial poses (e.g., cellphone, school bus, and garbage truck in Figs. S11, S12, and S13). This result supports the hypothesis that the generated adversarial poses are outofdistribution.
Searching from the validation set. We also searched the entire 50,000image validation set of ImageNet. Interestingly, we found the top5 nearest images were sometimes from the same class as the targeted misclassification label (see Fig. S19).