Abstract
Despite impressive performance on numerous visual tasks, Convolutional NeuralNetworks (CNNs)  unlike brains  are often highly sensitive to smallperturbations of their input, e.g. adversarial noise leading to erroneousdecisions. We propose to regularize CNNs using largescale neuroscience data tolearn more robust neural features in terms of representational similarity. Wepresented natural images to mice and measured the responses of thousands ofneurons from cortical visual areas. Next, we denoised the notoriously variableneural activity using strong predictive models trained on this large corpus ofresponses from the mouse visual system, and calculated the representationalsimilarity for millions of pairs of images from the model's predictions. Wethen used the neural representation similarity to regularize CNNs trained onimage classification by penalizing intermediate representations that deviatedfrom neural ones. This preserved performance of baseline models whenclassifying images under standard benchmarks, while maintaining substantiallyhigher performance compared to baseline or control models when classifyingnoisy images. Moreover, the models regularized with cortical representationsalso improved model robustness in terms of adversarial attacks. Thisdemonstrates that regularizing with neural data can be an effective tool tocreate an inductive bias towards more robust inference.
Quick Read (beta)
Learning From Brains How to Regularize Machines
Abstract
Despite impressive performance on numerous visual tasks, Convolutional Neural Networks (CNNs) — unlike brains — are often highly sensitive to small perturbations of their input, e.g. adversarial noise leading to erroneous decisions. We propose to regularize CNNs using largescale neuroscience data to learn more robust neural features in terms of representational similarity. We presented natural images to mice and measured the responses of thousands of neurons from cortical visual areas. Next, we denoised the notoriously variable neural activity using strong predictive models trained on this large corpus of responses from the mouse visual system, and calculated the representational similarity for millions of pairs of images from the model’s predictions. We then used the neural representation similarity to regularize CNNs trained on image classification by penalizing intermediate representations that deviated from neural ones. This preserved performance of baseline models when classifying images under standard benchmarks, while maintaining substantially higher performance compared to baseline or control models when classifying noisy images. Moreover, the models regularized with cortical representations also improved model robustness in terms of adversarial attacks. This demonstrates that regularizing with neural data can be an effective tool to create an inductive bias towards more robust inference.
Learning From Brains How to Regularize Machines
Zhe Li^{12,*}, Wieland Brendel^{45}, Edgar Y. Walker^{12,7}, Erick Cobos^{12}, Taliah Muhammad^{12}, Jacob Reimer^{12}, Matthias Bethge^{46}, Fabian H. Sinz^{2,5,7}, Xaq Pitkow^{13}, Andreas S. Tolias^{13} ^{1} Department of Neuroscience, Baylor College of Medicine ^{2} Center for Neuroscience and Artificial Intelligence, Baylor College of Medicine ^{3} Department of Electrical and Computer Engineering, Rice University ^{4} Centre for Integrative Neuroscience, University of Tübingen ^{5} Bernstein Center for Computational Neuroscience, University of Tübingen ^{6} Institute for Theoretical Physics, University of Tübingen ^{7} Institute Bioinformatics and Medical Informatics, University of Tübingen ^{*}[email protected]
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Convolutional neural network (CNN) models are widely used in computer vision tasks, and can achieve superhuman performance on many classification tasks [1, 2]. However, there is a still huge gap between these models and the human visual system in terms of robustness and generalization [3, 4, 5]. In fact, the invariant neural representations and the ability to generalize across complex transformations has been seen as the hallmark of visual intelligence [6, 7, 8, 9]. Understanding why the visual system has superior performance on so many perceptual problems is one of the central questions of neuroscience and machine learning. In particular, CNNs are vulnerable to adversarial attacks and noise distortions [10, 3, 4] while human perception is barely affected by these small image perturbations. This highlights that stateoftheart CNNs lack human level scene understanding and do not rely on the same causal features as humans for visual perception [4, 5, 11].
Regularization and implicit inductive biases in deep networks can positively affect robustness and generalization by constraining the parameter space and biasing the trained model to use better features. However, these biases are often rather nonspecific and networks often latch onto patterns that do not generalize well outside the distribution of training data. Biological visual systems, however, cope with strongly varying conditions all the time. Based on recently reported overlap between the sensory representations of tasktrained CNNs and representations measured in primate brains [12, 13, 14, 15, 16], we thus hypothesized that biasing the representation of artificial networks towards biological stimulus representations might positively affect their robustness.
Here, we show that directly measuring the neural representation in animal visual cortices and biasing CNN models toward a more biological feature space can indeed lead to more robust models. To this end, we recorded the simultaneous responses of thousands of neurons to complex natural scenes in visual cortex of awake mice. In order to bias a CNN towards biological feature representations, we modified its objective function so that convolutional features are encouraged to establish the same structure as neural activities. We found that by regularizing a ResNet [1] towards a biological neural representation, the trained models had higher classification accuracy than baseline when input images were corrupted by random noise or adversarial perturbations. Regularization towards random representations or features from a pretrained VGG model was substantially less helpful.
2 Neural representation similarity
We performed several 2photon scans in primary visual cortex on multiple mice, with repeated scans per animal across different days. During the experiment, the headfixed mice were able to run on a treadmill while passively viewing natural image each presented for 500ms. In each experiment, we measured responses to 5100 different grayscale images sampled from the ImageNet dataset, 100 of which were repeated 10 times to give 6000 trials in total. Each image was downsampled by a factor of four to 64 $\times $ 36 pixels. We call the repeated images ‘oracle images’, because the mean neural responses over these repeated trials serve as a high quality predictor (oracle) for validation trials. The major reason for choosing mice in our study is they allow for genetic tools for large scale recordings ($\sim $8000 units simultaneously). While mice indeed do not have as sophisticated visual systems as primates, vision is still one of their major sensory inputs. Grayscale images were used because mice are not sensitive to the colors relevant to human vision.
We begin by defining the similarity metric for neural responses, which we will then use to regularize a CNN for image classification. In a first step, the raw response ${\rho}_{ai}$ for each neuron $a$ to stimulus $i$ is scaled by its signaltonoise ratio
${w}_{a}$  $={\displaystyle \frac{{\sigma}_{a}}{{\eta}_{a}}},$  (1) 
which was estimated from responses to repeated stimuli, namely the oracle images. For a neuron $a$, the signal strength ${\sigma}_{a}^{2}={\mathrm{Var}}_{i}({\mathbb{E}}_{t}[{r}_{ait}])$ is the variance over stimuli $i$ of the mean response over repeated trials $t$. The noise strength is the mean over stimuli of the variance over trials, ${\eta}_{a}^{2}={\mathbb{E}}_{i}[{\mathrm{Var}}_{t}({r}_{ait})]$. We denote these scaled responses by ${r}_{ai}={w}_{a}{\rho}_{ai}$. The scaled population response to stimulus $i$ is the vector ${\bm{r}}_{i}$. Scaling responses based on signaltonoise ratio accounts for the reliability of each neuron by reducing the influence of noisy neurons. For example, if the responses of a neuron to the same image are highly variable, we will ignore its contribution to the similarity metric by assigning a small weight to it, no matter how differently it responds to different images or how high its responses are in general.
We then shift and normalize these population responses, creating centered unit vectors ${\bm{e}}_{i}=\frac{{\bm{r}}_{i}\overline{\bm{r}}}{\parallel {\bm{r}}_{i}\overline{\bm{r}}\parallel}$ where $\overline{\bm{r}}={\mathbb{E}}_{i}[{\bm{r}}_{i}]$ is the population response averaged over all stimuli. These unit vectors are then used to construct the similarity matrix, according to
${S}_{ij}^{\mathrm{data}}={\bm{e}}_{i}\cdot {\bm{e}}_{j}$  (2) 
for stimuli $i$ and $j$.
2.1 Stability across animals and days
Averaging the responses to the repeated presentations of the oracle images allows us to reduce the influence of neural noise in the representation similarity metric defined in Eq. 2 and examine its stability across scans (i.e. different selection of neurons). When calculating similarity between oracle images, we can average the results of different trials to reduce noise. For given image $i$ with $T$ repeats, we first treat those trials as if they are different images ${i}_{1},\mathrm{\dots},{i}_{T}$, and calculate similarity against repeated trials of another oracle image $j$, (${j}_{1},\mathrm{\dots},{j}_{T}$) in every combination. Oracle similarity is defined as the mean value of all trial similarity values
${S}_{ij}^{\mathrm{oracle}}={\mathbb{E}}_{{t}_{i},{t}_{j}}\left[{S}_{{i}_{{t}_{i}}{j}_{{t}_{j}}}^{\mathrm{data}}\right],$  (3) 
with ${S}_{{i}_{t}{i}_{t}}^{\mathrm{data}}=1$ excluded when $i=j$.
We found that the neural representation similarity between images is stable across scans and across mice in primary visual cortex (Fig. 1A). When images (columns and rows) are ordered for better visualization, there is a visible structure consistent across scans, revealing the clustering organization of these images. We further index the matrix for scan $h$ as ${S}_{ij}^{\mathrm{oracle}h}$, and compare the fluctuation across scans
$\mathrm{\Delta}{S}_{h,i,j}^{\mathrm{scan}}={S}_{ij}^{\mathrm{oracle}h}{\mathbb{E}}_{h}\left[{S}_{ij}^{\mathrm{oracle}h}\right],$  (4) 
and the fluctuation across repeats
$\mathrm{\Delta}{S}_{h,i,{t}_{1},{t}_{2}}^{\mathrm{repeat}}={S}_{{i}_{{t}_{1}}{i}_{{t}_{2}}}^{\mathrm{data}h}{S}_{ii}^{\mathrm{oracle}h}.$  (5) 
We observer a much narrower distribution for $\mathrm{\Delta}{S}^{\mathrm{scan}}$ than $\mathrm{\Delta}{S}^{\mathrm{repeat}}$ (Fig. 1C), suggesting that the variability due to the selection of neurons (scans) is much lower than the single trial variability to the same image.
2.2 Denoising neural responses with a predictive model
Most images in our experiments were only presented once to maximize the diversity of stimuli, so ${S}^{\mathrm{oracle}}$ is not available for them while ${S}^{\mathrm{data}}$ was too noisy for our purpose. To exploit the neural responses for nonoracle images, we first train a predictive model to denoise data. The predictive model is consisted of a simple 3layer CNN with skip connection [17, 18]. It takes images as inputs and predict neural responses by a linear readout at the last layer. In addition, behavioral data such as the pupil position and size, as well as the running speed on the treadmill are also fed to the model to account for the effect of nonvisual variables.
The predicted response for neuron $a$ to stimulus $i$ is denoted as ${\widehat{\rho}}_{ai}$, which is trained to predict ${\rho}_{ai}$ well [18]. The correlation between ${\widehat{\rho}}_{a}$ and ${\rho}_{a}$ is denoted as ${v}_{a}$, indicating how well neuron $a$ is predicted. The scaled model response is defined as ${\widehat{r}}_{ai}={w}_{a}{v}_{a}{\widehat{\rho}}_{ai}$ with the sigaltonoise weight ${w}_{a}$ from Eq. 1, and the population response is then denoted as ${\widehat{\bm{r}}}_{i}$. The similarity matrix for scaled model responses is calculated in the same way as Eq. 2,
${S}_{ij}^{\mathrm{model}}={\widehat{\bm{e}}}_{i}\cdot {\widehat{\bm{e}}}_{j}.$  (6) 
Similarity matrices for the same set of oracle images are shown in Fig. 1B, each from a model trained for the corresponding scan. The similarity for measured neural responses, ${S}^{\mathrm{oracle}}$, are also present in the model response similarities, but the structure is more prominent for the model responses. A scatter plot of data and model similarities, ${S}_{ij}^{\mathrm{oracle}}$ versus ${S}_{ij}^{\mathrm{model}}$ (Fig. 1D), shows a high correlation $r=0.73$, but the model similarities have a wider range. In the same plot we also showed the correlation between ${S}^{\mathrm{oracle}}$ and the corresponding trial similarity values ${S}^{\mathrm{data}}$ from which they are estimated, and found ${S}^{\mathrm{model}}$ to be much less noisy than ${S}^{\mathrm{data}}$.
The use of model neuron responses as a proxy for the real neurons has three major benefits. First, the outputs are deterministic, eliminating the random noise component. Second, the predictive model was heavily regularized during training, so these deterministic responses are more likely to reflect reliable visual features. Third, the model’s shifter and modulator circuit [17] accounted for the irrelevant nonvisual eye and body movements, and could thereby extract more of the purely visualdriven responses.
With the help of a predictive model, we can obtain cleaner responses for the 5000 nonoracle images even though they are only measured once. We used the similarity matrices averaged over 8 scans as the regularization target. Two examples of the model neural similarity for the 100 oracle images are shown in Fig. 2. It is worth clarifying that we don’t use this 100$\times $100 matrix in our main result though, but only the 5000$\times $5000 matrix from nonoracle trials. Oracle trials are used for evaluating predictive models, assigning neuronspecific weights and demonstrations (Fig. 1 and 2) only.
3 Neural regularization by joint training
To regularize a standard machine learning model with the representation similarity matrix obtained from neural data [19], we jointly train the model with a similarity loss in addition to its original taskdefined loss (Fig. 3, also see [20] and [21] for related approaches based on fMRI or other deep neuronal networks, respectively). The full loss function contains two terms, defined as
$L={L}_{\mathrm{task}}+\alpha {L}_{\mathrm{similarity}}.$  (7) 
The first term is a conventional loss used to define the performance on the task, such as classification or 1shot learning. In this section, we implement grayscale CIFAR10 classification, hence we use a crossentropy loss. The second term is the penalty that favors brainlike representations, with a coefficient $\alpha $ determining regularization strength.
For any pair of images that were shown to the mice, we already have their representational similarity from models predicting neural data (Eq. 6). Since we are now comparing similarity for two models, a neural predictive model and a task model based on a convolutional neural network, we denote the former by ${S}_{ij}^{\mathrm{neural}}$ and the latter by ${S}_{ij}^{\mathrm{CNN}}$. We want ${S}^{\mathrm{CNN}}$ to approximate ${S}^{\mathrm{neural}}$ well.
We define the similarity loss for image $i$ and image $j$ as
${L}_{\mathrm{similarity}}={\left[\mathrm{arctanh}\left({S}_{ij}^{\mathrm{CNN}}\right)\mathrm{arctanh}\left({S}_{ij}^{\mathrm{neural}}\right)\right]}^{2}.$  (8) 
The $\mathrm{arctanh}$ is used to remap the similarities from the interval $[1,1]$ to $(\mathrm{\infty},\mathrm{\infty})$. It is analogous to the Fisher transform which uses the same $\mathrm{arctanh}$ function to compute confidence intervals for correlation coefficients, by reparameterizing the correlations to follow nearly normal distributions. When similarity values are not too close to $1$ or 1, the loss is close to the sample based centered kernel alignment (CKA) index [22, 23, 24].
Intuitively, ${S}^{\mathrm{CNN}}$ is the cosine similarity of convolutional features that image $i$ and $j$ activate. Though V1 responses are thought to encode lowlevel features, there’s no principled way to determine a priori which single model layer corresponds to V1. Thus we flexibly combine feature similarities from a selection of layers instead of assigning to a specific one. Specifically, we calculate similarity for $K$ uniformly located convolutional layers, and average the results through a trainable weight. The weights are the outputs of a softmax function, therefore guaranteed to be positive and sum to one. Mathematically speaking, for each of the $K$ layers we compute the cosine similarity values as
${S}_{ij}^{\mathrm{CNN}k}={\displaystyle \frac{\left({\bm{f}}_{i}^{(k)}{\overline{\bm{f}}}^{(k)}\right)\cdot \left({\bm{f}}_{j}^{(k)}{\overline{\bm{f}}}^{(k)}\right)}{\parallel {\bm{f}}_{i}^{(k)}{\overline{\bm{f}}}^{(k)}\parallel \parallel {\bm{f}}_{j}^{(k)}{\overline{\bm{f}}}^{(k)}\parallel}},$  (9) 
where ${\bm{f}}_{i}^{(k)}$ is the concatenated convolutional feature vector for image $i$ at layer $k$, and ${\overline{\bm{f}}}^{(k)}={\mathbb{E}}_{i}\left[{\bm{f}}_{i}^{(k)}\right]$ is its mean over images. The final model similarity is a combination from all selected layers
${S}_{ij}^{\mathrm{CNN}}={\displaystyle \sum _{k}}{\gamma}_{k}{S}_{ij}^{\mathrm{CNN}k},$  (10) 
where ${\gamma}_{k}$ is a trainable probability with ${\sum}_{k}{\gamma}_{k}=1,{\gamma}_{k}\ge 0$. This means that the objective function can choose which layer to match the similarity, but it needs to match at least one in total as enforced by the softmax that determines ${\gamma}_{k}$. In principle all convolutional layers can be included, but we only used 5 in our simulations (layer 1, 5, 9, 13, 17 in ResNet18). The preliminary analysis shows after training one layer will dominate in Eq. 10, and it is typically layer 5, the last layer of the first ResBlock group. More details are included in the supplementary materials.
In each step of training, we first process a batch of CIFAR images to calculate classification loss ${L}_{\mathrm{classification}}$, and subsequently process a batch of image pairs sampled from the stimuli we used in experiments, calculating the similarity loss ${L}_{\mathrm{similarity}}$ with respect to the precomputed ${S}^{\mathrm{neural}}$ matrix. The gradient of the full loss can affect the CNN kernel weights through both loss terms.
4 Results
4.1 Robustness against random noise
The similarity loss plays the role of a regularizer, and it biases the original CNN towards a more brainlike representation. We observed that the CNN model becomes more robust to random noise when neural regularization is used. Compared to a ResNet18 [1] trained without any regularization (‘None’ in Fig. 4A), the same architecture equipped with the neural regularizer (‘Neural (model)’ in Fig. 4) had substantially better performance on noisy input images ($\sim $50% v.s. $\sim $20% at the highest noise level). In other words, models whose features are more neural are less vulnerable to random noise in inputs. To strengthen this conclusion, we also regularized the model with shuffled ${S}^{\mathrm{neural}}$ matrix (‘Shuffle’ in Fig. 4) or the feature similarity matrix of the conv31 layer in a VGG19 model pretrained on ImageNet (‘VGG’ in Fig. 4). This VGG layer has been reported to be most similar to animal V1 [16]. Both regularizers improve the model robustness to some degree but neither as much as using the neural regularizer.
Finally, we also regularized the model with a similarity matrix from the actual data directly (‘Neural (data)’ in Fig. 4), using ${S}^{\mathrm{data}}$ (Eq. 2) instead of ${S}^{\mathrm{model}}$ (Eq. 6). We did not observe the same boost in robustness. We think that this is caused by the high variability of the neural responses, highlighting the need for a well trained predictive model. In addition, if we see the matrix in ‘Shuffle’ control as the feature similarity of a poorly trained predictive model, the difference between ‘Neural (model)’ and ‘Shuffle’ again shows the importance of having a well trained one. Only with a strong predictive system identification model as a denoiser were we able to reveal the underlying representational structure hidden in the noisy neural data.
We observed the same results when training ResNet34 models on grayscale CIFAR100 datasets (Fig. 4B). In addition, we also tested how different regularization strength will affect the model performance, and observed a continuous increase of model robustness when we tuned up the regularization. More details are included in the supplementary materials.
All models are trained by stochastic gradient descent for 40 epochs with batch size 64. Learning rate starts at 0.1 and decays by 0.3 every 4 epochs, but resets to 0.1 after the 20th epoch. Mean classification accuracy for CIFAR10/100 test set over 5 random seeds is reported in Fig. 4. In our current setting, the same number of images are passed in the classification pathway and neural pathway, hence the time cost approximately doubles comparing to normal training. It takes about 4.5 hours on a single TITAN RTX GPU to train one model. We used PyTorch [25] for model training.
4.2 Robustness against adversarial attack
We are also interested in whether neural regularization provides robustness to adversarial attacks. Since adversarial examples and their innocent counterparts elicit the same percept by definition, it is highly possible that their measured neural representations are also close to each other. Hence a model with neural representation will be more invariant to adversarial noise. We evaluated model robustness following a recently published guideline [26] and using the welltested attack implementations provided by Foolbox [27].
Our evaluation metric follows [28]. In a nutshell, we strive to find adversarial perturbations (i.e. perturbations that flip the label to any but the groundtruth class) with the minimum norm (either ${L}_{2}$ or ${L}_{\mathrm{\infty}}$) for each of 1000 test samples. We then compute the median perturbation distance across all samples as the final robustness score (higher is better).
Besides the current stateoftheart attacks on ${L}_{2}$ [26] and ${L}_{\mathrm{\infty}}$ [29], we also deployed a recently developed gradientbased version of the decisionbased boundary attack [30], which surpasses [26] in terms of query efficiency and the size of the minimal adversarial perturbations. In short, [30] starts from a natural input sample that is classified as different from the original image (for which we aim to generate an adversarial example). The algorithm then performs a line search between the two images to find the decision boundary of the model. The gradients w.r.t. the difference between the two topmost logits allow us to estimate the local geometry of the decision boundary. Using this geometry we can compute the optimal adversarial perturbation that (a) takes us exactly to the boundary (in case we are slightly shifted away from it), (b) stays within the valid pixel bounds, (c) minimizes the distances to the original image, and (d) is not too far from the current perturbation (to make sure we stay within the region for which the linear approximation of the boundary is valid). Therefore, our gradientbased version of the decision boundary attack provides a most stringent test for adversarial robustness of our neural network machine learning models regularized with neural data.
To ensure that we evaluate and compare all models fairly, we perform an extensive hyperparameter search and always select the optimal combination. Since our gradientbased boundary attack proved more effective than [26] on all models tested here, we only deployed the gradientbased boundary attack for ${L}_{2}$, and used Projected Gradient Descent (PGD) [29] for ${L}_{\mathrm{\infty}}$ in our final evaluation. For our gradientbased boundary attack, we tested step sizes of {0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.03}, and for PGD we tested step sizes of {${10}^{6}$, ${10}^{5}$, ${10}^{4}$, ${10}^{3}$, ${10}^{2}$, ${10}^{1}$, 1} with iterations of {10, 30, 50, 100, 200}.
Fig. 5 shows that regularizing models with neural representational similarity improves model robustness against adversarial attacks. The model with the smallest adversarial perturbations (most fragility) is the vanilla model trained without any regularization (median perturbation of $0.0025$ (${L}_{\mathrm{\infty}}$) and $0.09$ (${L}_{2}$)). Regularizing with random similarity matrix (median perturbation of $0.003$ (${L}_{\mathrm{\infty}}$) and $0.11$ (${L}_{2}$)) or similarity of VGG features (median perturbation of $0.0028$ (${L}_{\mathrm{\infty}}$) and $0.11$ (${L}_{2}$)) increases robustness. The strongest increase in robustness, in both metrics, is provided by the regularization with the brain’s representations learned from neural data (median perturbation of $0.0034$ (${L}_{\mathrm{\infty}}$) and $0.13$ (${L}_{2}$)).
We additionally did a more thorough experiment with a few more type of attacks, and looked at ${L}_{0}$ and ${L}_{1}$ metrics. More details are included in the supplementary materials.
We observed increased robustness across several metrics, which is quite remarkable given that current defense methods, in particular adversarial training, tend to overfit strongly on the metrics on which they are trained on [28] and are often less robust on other ${L}_{p}$ metrics than undefended models.
5 Conclusion and discussion
Neuroscience has often provided inspiration to machine learning, but it lacks methods to directly translate neurophysiological recordings into an improvement of artificial neural networks. Here, we have shown that regularization with neural data proves to be a promising tool to create an inductive bias towards more robust inference. In particular, the mouse brain has evolved an image representation that is capable of performing difficult machine learning tasks, but is more robust than conventional models. Specifically, we demonstrated that when these measured representations are incorporated into a general machine learning model by matching representational similarity, they enable more robust machine learning algorithms (robustness to random noise and adversarial attacks). Critically, our predictive computational model provided a better assessment of representation than raw neural data, because it disentangles nonvisual features from visual ones, and transforms the isolated visual features into a reliable, denoised version of neural responses. This modelbased representation proved useful as a regularization target for machine learning models. We conjecture that by regularizing machine learning models further to match representational similarities with higher order visual areas beyond V1 will further enhance the robustness and generalization performance outside the training set. These brainlike representations may help machine learning algorithms ultimately reach humanlike performance.
There are at least two ways to regularize CNN models to favor neural representations. One is to learn the similarity for any pair of images, like our approach here. The other is to jointly train a linear readout from intermediate layers of CNN to predict neural responses directly. However, we argue that the former is a tighter constraint since a wide range of affine transformations in the CNN could be compensated by the linear readout, producing identical predictions for the neural responses while substantially altering the underlying representational similarity in the CNN. For this reason, we chose to regularize our machine learning models to match the representational similarity.
Though the improvement of adversarial robustness by neural regularization is substantial and significant, unsurprisingly, the current stateoftheart in terms of robustness on ${L}_{\mathrm{\infty}}$ [29] remains substantially more robust than our neurally regularized but otherwise undefended model ($0.029$ vs $0.0034$). That said, [29] employs an expensive adversarial training procedure that—in contrast to our method—specifically aims to optimize robustness against ${L}_{\mathrm{\infty}}$ perturbations. As a side effect, [29] performs significantly worse on metrics it has not been trained on, such as ${L}_{2}$ or ${L}_{0}$ [28] while our method does not overfit on one specific metric. Combining the neural regularization with adversarial training procedure [29] could potentially lead to even stronger defenses.
The neural regularization is not designed to improve model robustness, but rather to bias any model to have neural features. We expect to see other benefits with such inductive bias, such as improved generalization in domain transfer, lower sample complexity in fewshot learning, and so on. While more systematic analysis is continuing, preliminary results indeed have shown improvement by neural regularization in those aspects as well.
To bias CNN features towards a more brainlike representation, we matched the pairwise cosine similarity for a given set of inputs in this study. But this is just one approach of manifold matching in a more general sense [31]. We will explore other metrics or higherorder dependencies in the future.
While our results indeed show the benefit of adopting more brainlike representation in visual processing, it is however unclear which aspects of neural representation make it work. We think that it is the most important question and we need to understand the principle behind it. There are two approaches that we are currently working on. The first is to directly compare the regularized models and the vanilla ones by investigating the features they use. We will look into the tuning property of model units by finding the input patterns that maximally excites these units, and examine how neural regularization makes a difference. The second is to identify which neurons are useful for a more robust representation. We can either find the subset of neurons that are most important in the similarity metric and look for their common properties. Or we can propose some criterion to select a particular set of neurons and check whether using those neurons alone can obtain the same robustness gain. If we manage to understand why neural regularization works, we’ll be able to design or train machine learning models just with the underlying principles, without actually performing largescale neural recordings.
A docker image containing all codes and trained models is prepared (zheli18/neuralreg:neurips19), with a jupyter lab as entrypoint.
Acknowledgements
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government. FS is supported by the Institutional Strategy of the University of Tübingen (Deutsche Forschungsgemeinschaft, ZUK 63), the CarlZeissStiftung, and Amazon AWS through a Machine Learning Research Award. FS and MB acknowledges the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039A) and the DFG Cluster of Excellence “Machine Learning – New Perspectives for Science”, EXC 2064/1, project number 390727645.
Supplementary materials
Robustness dependence on regularization strength
The training objective is a combination of task loss and similarity loss, with a relative weight $\alpha $ in $L={L}_{\mathrm{task}}+\alpha {L}_{\mathrm{similarity}}$. We tested a range of $\alpha $ values, and observed a continuous change in model performance (Fig. 6). Here similarity matrix is estimated using just one scan, while results in the main text are using averaged similarity matrix from eight scans. $\alpha =20$ was used in the main text, qualitatively same as the $\alpha =16$ shown here. For each $\alpha $, 2 or 3 random seeds were used.
Combination weights for CNN model similarity
We used a trainable weight ${\gamma}_{k}$ (Eq. 10 in main text) to combine feature similarity of different convolutional layers to the final similarity of the full model. We design ${\gamma}_{k}$s to be the outputs of a softmax function, and have the same initial values. In our simulations, $K=5$ layers are selected, hence ${\gamma}_{k}=0.2$ for $k=1,5,9,13,17$ in the beginning of training.
We observed that after joint training, ${\gamma}_{k}$ usually collapse to only one layer. Namely ${\gamma}_{k}\approx 1$ for one layer, and close to 0 for the others. We think this is a direct result from the competitive nature of our weight design. As long as one layer is selected to resemble the neural feature space, the joint training algorithm will keep pushing it towards the target. The identity of the selected layer, which is usually the easiest one to adjust to neural feature space, is not deterministic. We investigated final weights for models in Fig. 6, and the averaged weights are listed in Tab. 1.
${\gamma}_{1}$  ${\gamma}_{5}$  ${\gamma}_{9}$  ${\gamma}_{13}$  ${\gamma}_{17}$  

$\alpha =0$  0.2  0.2  0.2  0.2  0.2 
$\alpha =2$  0  1  0  0  0 
$\alpha =4$  0  1  0  0  0 
$\alpha =8$  0  0.33  0.67  0  0 
$\alpha =16$  0  0.67  0.33  0  0 
$\alpha =32$  0.5  0.5  0  0  0 
For example, ${\gamma}_{5}=0.67$ for $\alpha =16$ actually corresponds to that 2 out of 3 random seeds result in a trained model with ${\gamma}_{5}=1$. Though there exists stochasticity in the choice of layers, the possible ones are usually nearby in terms of their locations in the deep network. Admittedly, more simulations are needed to be conclusive.
More extensive tests on adversarial robustness
We performed a much more thorough tests on our trained models with two more metrics and six more attacks after the submission. The models being tested here (‘None’, ‘Shuffle’ and ‘Neural’) are also newer version since we improved the neural predictive model since then. In short, more reliably measured neurons are weighted even more now, which in theory makes the neural similarity matrix less noisy.
The evaluation of the models follows the evaluation scheme of [32]. We tested all models on four different ${L}_{p}$ metrics (${L}_{0},{L}_{1},{L}_{2}$ and ${L}_{\mathrm{\infty}}$) with different stateoftheart attacks (see below). Every model/attack combination was evaluated on 1000 samples from the CIFAR10 validation set and we used the same subset for all models. Then, on each sample and on each model/attack combination each attack was run five times for each hyperparameter setting we tested in an untargeted attack scenario. For each attack we tested a range of hyperparameters to ensure optimal performance. We used attacks as implemented in Foolbox [27]. To gather the final distortion sizes shown in Fig. 7 we determined the smallest ${L}_{p}$ distance for each sample and for each model/attack combination across all tested hyperparameters and repetitions. We hope that this scheme approaches as closely as possible the true minimal adversarial distance. We then average this minimal adversarial distance over all 1000 samples to determine model robustness.
Across ${L}_{1},{L}_{2}$ and ${L}_{\mathrm{\infty}}$ we observe a market increase in robustness compared to baseline and control networks. This increase is unlikely to be caused by gradientmasking given that adversarial attacks work equally well on all models on the ${L}_{0}$ norm. At the same time, ${L}_{0}$ is also a special metric in the sense that it introduces strong deviations between original and adversarial image which are also the most noticeable for humans.
The attacks that we applied to the models are as follows:

•
Projected Gradient Descent (PGD) [29]. Iterative gradient attack that optimizes ${L}_{\mathrm{\infty}}$ by minimizing a crossentropy loss under a fixed ${L}_{\mathrm{\infty}}$ norm constraint enforced in each step.

•
Projected Gradient Descent with Adam (AdamPGD) [33]. Same as PGD with but Adam Optimiser for update steps.

•
C&W [26]. ${L}_{2}$ iterative gradient attack that relies on the Adam optimizer, a tanhnonlinearity to respect pixelconstraints and a loss function that weighs a classification loss with the distance metric to be minimized.

•
Decoupling Direction and Norm Attack (DDN) [34]. ${L}_{2}$ iterative gradient attack pitched as a queryefficient alternative to the C&W attack that requires less hyperparameter tuning.

•
SaliencyMap Attack (JSMA) [35]. ${L}_{0}/{L}_{1}$ attack that iterates over saliency maps to discover pixels with the highest potential to change the decision of the classifier.

•
SparseFool [36]. A sparse version of DeepFool, which uses a local linear approximation of the geometry of the decision boundary to estimate the optimal step towards the boundary.

•
Brendel&Bethge [32]. Novel family of ${L}_{0}/{L}_{1}/{L}_{2}/{L}_{\mathrm{\infty}}$ attacks that follow the boundary between the adversarial and nonadversarial region which has been demonstrated to be stateoftheart on all tested ${L}_{p}$ norms.
References
 [1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [2] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. CoRR, abs/1709.01507, 2017.
 [3] Robert Geirhos, Carlos R. M. Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, and Felix A. Wichmann. Generalisation in humans and deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7538–7550. Curran Associates, Inc., 2018.
 [4] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenettrained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, pages 1–21, 2019.
 [5] Wieland Brendel and Matthias Bethge. Approximating CNNs with bagoflocalfeatures models works surprisingly well on imagenet. In International Conference on Learning Representations, 2019.
 [6] Fabio Anselmi, Lorenzo Rosasco, and Tomaso Poggio. On invariance and selectivity in representation learning. Information and Inference: A Journal of the IMA, 5(2):134–158, 05 2016.
 [7] Tomaso A. Poggio and Fabio Anselmi. Visual Cortex and Deep Networks: Learning Invariant Representations. The MIT Press, 1st edition, 2016.
 [8] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
 [9] Andrea Tacchetti, Leyla Isik, and Tomaso A. Poggio. Invariant recognition shapes neural representations of visual input. Annual Review of Vision Science, 4(1):403–422, 2018.
 [10] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv eprints, 2013.
 [11] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. arXiv eprints, May 2019.
 [12] Daniel L. K. Yamins, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James J. DiCarlo. Performanceoptimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23):8619–8624, 2014.
 [13] SeyedMahdi KhalighRazavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLOS Computational Biology, 10(11):1–29, 2014.
 [14] Umut Güçlü and Marcel A. J. van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.
 [15] Daniel L. K. Yamins and James J. DiCarlo. Using goaldriven deep learning models to understand sensory cortex. Nature Neuroscience, 19:356, 2016.
 [16] Santiago A. Cadena, George H. Denfield, Edgar Y. Walker, Leon A. Gatys, Andreas S. Tolias, Matthias Bethge, and Alexander S. Ecker. Deep convolutional models improve predictions of macaque v1 responses to natural images. PLOS Computational Biology, 15(4):e1006897, 2019.
 [17] Fabian Sinz, Alexander S Ecker, Paul Fahey, Edgar Walker, Erick Cobos, Emmanouil Froudarakis, Dimitri Yatsenko, Zachary Pitkow, Jacob Reimer, and Andreas Tolias. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7199–7210. Curran Associates, Inc., 2018.
 [18] Edgar Y. Walker, Fabian H. Sinz, Emmanouil Froudarakis, Paul G. Fahey, Taliah Muhammad, Alexander S. Ecker, Erick Cobos, Jacob Reimer, Xaq Pitkow, and Andreas S. Tolias. Inception in visual cortex: in vivosilico loops reveal most exciting images. bioRxiv, 2018.
 [19] Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. Representational similarity analysis  connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 2008.
 [20] Nathaniel Blanchard, Jeffery Kinnison, Brand on RichardWebster, Pouya Bashivan, and Walter J. Scheirer. A neurobiological evaluation metric for neural network model search. arXiv eprints, page arXiv:1805.10726, 2018.
 [21] Patrick McClure and Nikolaus Kriegeskorte. Representational distance learning for deep neural networks. Frontiers in Computational Neuroscience, 10:131, 2016.
 [22] Nello Cristianini, John ShaweTaylor, André Elisseeff, and Jaz S. Kandola. On kerneltarget alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 367–373. MIT Press, 2002.
 [23] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. J. Mach. Learn. Res., 13(1):795–828, 2012.
 [24] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. arXiv eprints, page arXiv:1905.00414, 2019.
 [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 [26] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57, 2017.
 [27] Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv eprints, page arXiv:1707.04131, 2017.
 [28] Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on MNIST. arXiv eprints, page arXiv:1805.09190, 2018.
 [29] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
 [30] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decisionbased adversarial attacks: Reliable attacks against blackbox machine learning models. arXiv eprints, page arXiv:1712.04248, 2017.
 [31] A. Talwalkar, S. Kumar, and H. Rowley. Largescale manifold learning. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, June 2008.
 [32] Wieland Brendel, Jonas Rauber, Matthias Kümmerer, Ivan Ustyuzhaninov, and Matthias Bethge. Accurate, reliable and fast robustness evaluation. arXiv eprints, 2019.
 [33] Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. arXiv eprints, page arXiv:1802.05666, Feb 2018.
 [34] Jerome Rony, Luiz G. Hafemann, Luiz S. Oliveira, Ismail Ben Ayed, Robert Sabourin, and Eric Granger. Decoupling direction and norm for efficient gradientbased l2 adversarial attacks and defenses. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
 [35] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS P), pages 372–387, 2016.
 [36] Apostolos Modas, SeyedMohsen MoosaviDezfooli, and Pascal Frossard. Sparsefool: A few pixels make a big difference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.