Abstract
Precise estimation of uncertainty in predictions for AI systems is a criticalfactor in ensuring trust and safety. Deep neural networks trained with aconventional method are prone to over-confident predictions. In contrast toBayesian neural networks that learn approximate distributions on weights toinfer prediction confidence, we propose a novel method, Information RobustDirichlet networks, that learn an explicit Dirichlet prior distribution onpredictive distributions by minimizing the expected $L_p$ norm of theprediction error and penalizing information flow associated with incorrectoutcomes. Properties of the new cost function are derived to indicate howimproved uncertainty estimation is achieved. Experiments using real datasetsshow that our technique outperforms by a large margin state-of-the-art neuralnetworks for estimating within-distribution and out-of-distributionuncertainty, and detecting adversarial examples.
Quick Read (beta)
Information Robust Dirichlet Networks for Predictive Uncertainty Estimation
Abstract
Precise estimation of uncertainty in predictions for AI systems is a critical factor in ensuring trust and safety. Deep neural networks trained with a conventional method are prone to over-confident predictions. In contrast to Bayesian neural networks that learn approximate distributions on weights to infer prediction confidence, we propose a novel method, Information Robust Dirichlet networks, that learn an explicit Dirichlet prior distribution on predictive distributions by minimizing the expected ${L}_{p}$ norm of the prediction error and penalizing information flow associated with incorrect outcomes. Properties of the new cost function are derived to indicate how improved uncertainty estimation is achieved. Experiments using real datasets show that our technique outperforms by a large margin state-of-the-art neural networks for estimating within-distribution and out-of-distribution uncertainty, and detecting adversarial examples.
I Introduction
Deep learning systems have achieved state-of-the-art performance in various domains [1]. The first successful applications of deep learning include large-scale object recognition [2] and machine translation [3, 4]. While further advances have achieved strong performance and often surpass human-level ability in computer vision [5, 6, 7], speech recognition [8, 9], medicine [10], bioinformatics [11], other aspects of deep learning are less well understood. Conventional neural networks (NNs) are overconfident in their predictions [12] and provide inaccurate predictive uncertainty [13]. NNs have to be accurate, but also provide an indicator of when an error is likely to be made. Intepretability, robustness, and safety are becoming increasingly important as deep learning is deployed across various industries including healthcare, autonomous driving and cybersecurity.
Uncertainty modeling in deep learning is a crucial aspect that has been the topic of various Bayesian neural network (BNN) research studies [14, 15, 16, 17]. BNNs capture parameter uncertainty of the network by learning distributions on weights and estimate a posterior predictive distribution by approximate integration over these parameters. The non-linearities embedded in deep neural networks make the weight posterior intractable and several tractable approximations have been proposed and trained using variational inference [14, 15, 17, 16, 18], the Laplace approximation [19, 20], expectation propagation [21, 22], and Hamiltonian Monte Carlo [23]. The success of approximate BNN methods depends on how well the approximate weight distributions match their true counterparts, and their computational complexity is determined by the degree of approximation. Most BNNs take more effort to implement and are harder to train in comparison to conventional NNs. Furthermore, approximate integration over the parameter uncertainties increases the test time due to posterior sampling, and yields an approximate predictive distribution using stochastic averaging. Thus, it is of interest to develop methods that provide good uncertainty estimates while reusing the training pipeline and maintaining scalability. To this end, a simple approach was proposed that combines NN ensembles with adversarial training to improve predictive uncertainty estimates in a non-Bayesian manner [24], but is copmutationally expensive. It is also known that deterministic NNs are brittle to adversarial attacks [25, 26]. Predictive uncertainty can be used to reason about neural network predictions and detect when a network is likely to make an error, identify anomalous examples, and detect adversarial attacks.
In this paper, we propose Information Robust Dirichlet (IRD) networks that deliver more accurate predictive uncertainty than other state-of-the-art methods by learning how likely class probability assignments are. Our method modifies the output layer of neural networks and the training loss, therefore maintaining computational efficiency and ease of implementation. The contributions are as follows. First, a new training loss based on minimizing the expected ${L}_{p}$ norm of the prediction error is proposed under which the prediction probabilities follow a Dirichlet distribution. A closed-form approximation to this loss is derived, under which a neural network is trained to infer the parameters of a Dirichlet distribution, effectively teaching neural networks to learn distributions over class probability vectors. Second, a regularization loss is used to align the Dirichlet distribution parameters to an information direction that minimizes information flow towards incorrect classes. Third, an analysis is provided that shows how properties of the new loss improve uncertainty estimation. Finally, we demonstrate on real datasets that our technique obtains unmatched success in terms of uncertainty estimation for correct and incorrect predictions, detection of out-of-distribution queries and adversarial attacks.
I-A Related Work
Recently, in [27, 28] the Dirichlet distribution was used to model distributions of class compositions and its parameters were learned by training deterministic neural networks. This approach yields closed-form predictive distributions and outperforms BNNs in uncertainty quantification for out-of-distribution and adversarial queries. However, uncertainty estimation performance for within-distribution queries was not studied and out-of-distribution and adversarial query uncertainty can be improved. The authors in [27] provide a limited analysis of their loss, and [28] lacks analysis that relates the Dirichlet concentration parameters with their loss and further proposes to use OOD data for learning what is anomalous biasing the predictive uncertainty of the models.
In contrast, we provide a more thorough analysis of our loss function that yields insights into how neural networks shape Dirichlet distributions on the simplex. Furthermore, our method assigns higher uncertainties to errors while maintaining high confidence for correct predictions, and improves upon uncertainty quantification for OOD and adversarial data.
II Learning Distributions on the Probability Simplex
II-A Probabilistic Framework
Given dataset $\mathcal{D}=\{({\mathbf{x}}_{i},{\mathbf{y}}_{i})\}$, we model the class probability vectors for sample $i$ given by ${\text{\mathbf{p}}}_{i}$ as random vectors drawn from a Dirichlet distribution conditioned on the input ${\mathbf{x}}_{i}$ and weights $\bm{\theta}$. A neural network with input ${\mathbf{x}}_{i}$ and output ${\bm{\alpha}}_{i}$ is trained to learn multinomial opinions using the Dirichlet distribution $f({\text{\mathbf{p}}}_{i}|{\mathbf{x}}_{i};\bm{\theta})=f({\text{\mathbf{p}}}_{i};{\bm{\alpha}}_{i})$ (see (1)). This model can also be interpreted as an explicit prior over class probability distributions [28].
The predictive uncertainty of a classification model trained over this dataset can be expressed as:
$P(y=j|{\mathbf{x}}^{*},\mathcal{D})={\displaystyle \int}P(y=j|{\mathbf{x}}^{*},\bm{\theta})p(\bm{\theta}|\mathcal{D})d\bm{\theta}$ | ||
$\mathrm{\hspace{1em}}={\displaystyle \int}{\displaystyle \int}P(y=j|\text{\mathbf{p}})f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\bm{\theta})d\text{\mathbf{p}}\cdot p(\bm{\theta}|\mathcal{D})d\bm{\theta}$ | ||
$\mathrm{\hspace{1em}}={\displaystyle \int}P(y=j|\text{\mathbf{p}})f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\mathcal{D})d\text{\mathbf{p}}$ |
The terms above represent data uncertainty, $P(y=j|\text{\mathbf{p}})$, distribution uncertainty, $f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\bm{\theta})$, and model uncertainty, $p(\bm{\theta}|\mathcal{D})$. The Bayesian hierarchy implies that model uncertainty affects distributional uncertainty, which as a result influence the data uncertainty estimates. In our framework, the additional level of distributional uncertainty is incorporated to control the information spread over the simplex by learning $f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\bm{\theta})$ in a robust manner during the training procedure. This in turn regularizes the density $f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\mathcal{D})$ to produce improved predictive uncertainty estimates.
Since the posterior $p(\bm{\theta}|\mathcal{D})$ is intractable, approximate variational inference methods may be used in similar spirit to [14, 16] to estimate it. In addition, ensemble approaches are computationally expensive. For clarity in this paper, we assume a point-estimate of the weight parameters is sufficient given a large training set and proper regularization control, which yields $f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\mathcal{D})\approx f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\overline{\bm{\theta}})$. This simplifying approximation was also made in recent works [27, 28].
Conventional NNs for classification trained with a cross-entropy loss with a softmax output layer provide a point estimate of the predictive class probabilities of each example and do not have a handle on the underlying uncertainty. Cross-entropy training can be probabilistically interpreted as maximum likelihood estimation, which cannot infer predictive distribution variance. The softmax layer also tends to inflate the predicted class likelihood due to the exponentiation involved and this tyep of training tends to produce overconfident wrong predictions.
II-B Dirichlet Distribution
Outputs of neural networks for classification tasks are probability vectors over classes. The basis of our approach lies in an explicit model of distributional uncertainty that controls the distribution of such probability vectors using the Dirichlet distribution [29, 30]. Given the probability simplex as $\mathcal{S}=\{({p}_{1},\mathrm{\dots},{p}_{K}):{p}_{i}\ge 0,{\sum}_{i}{p}_{i}=1\}$, the Dirichlet distribution is a probability density function on vectors $\text{\mathbf{p}}\in \mathcal{S}$ given by
$$f(\text{\mathbf{p}};\bm{\alpha})=\frac{1}{B(\bm{\alpha})}\prod _{j=1}^{K}{p}_{j}^{{\alpha}_{j}-1}$$ | (1) |
where $B(\bm{\alpha})={\prod}_{j=1}^{K}\mathrm{\Gamma}({\alpha}_{j})/\mathrm{\Gamma}({\alpha}_{0})$ is the multivariate Beta function. It is characterized by concentration parameters $\bm{\alpha}=({\alpha}_{1},\mathrm{\dots},{\alpha}_{K})$ here assumed to be larger than unity ^{1}^{1} 1 The reason for this constraint is that the Dirichlet distribution becomes inverted for $$ concentrating in the corners of the simplex and along its boundaries.. The concentration parameter may be interpreted as how likely a class is relative to others. In the special case of the all-ones $\bm{\alpha}$ vector, the distribution becomes uniform over the probability simplex (see Fig. 1(d)). The mean of the proportions is given by ${\widehat{p}}_{j}={\alpha}_{j}/{\alpha}_{0}$, where ${\alpha}_{0}={\sum}_{j}{\alpha}_{j}$ is the Dirichlet strength.
The Dirichlet distribution is conjugate to the multinomial distribution with posterior parameters updated as ${\alpha}_{j}^{\prime}={\alpha}_{j}+{y}_{j}$ for a multinomial sample $\mathbf{y}=({y}_{1},\mathrm{\dots},{y}_{K})$. For a single sample, ${y}_{j}={I}_{\{j=c\}}$, where $c$ is the index of the correct class. Marginals of the Dirichlet distribution are Beta random variables, ${p}_{j}\sim \text{Beta}({\alpha}_{j},{\alpha}_{0}-{\alpha}_{j})$ with support on $[0,1]$. The $q$-th moment of the Beta distribution $\text{Beta}(a,b)$ is given by
$$\mathbb{E}[{p}^{q}]={\int}_{0}^{1}{p}^{q}\frac{{p}^{a-1}{(1-p)}^{b-1}}{{B}_{u}(a,b)}\mathit{d}p=\frac{{B}_{u}(a+q,b)}{{B}_{u}(a,b)}$$ | (2) |
where ${B}_{u}(a,b)=\mathrm{\Gamma}(a)\mathrm{\Gamma}(b)/\mathrm{\Gamma}(a+b)$ is the univariate Beta function. A Dirichlet neural network’s output layer parametrizes the simplex distribution representing the spread of class assignment probabilities. The softmax classification layer is replaced by a softplus activation layer that outputs non-negative continuous values, obtaining
$$\bm{\alpha}={g}_{\alpha}({\mathbf{x}}^{*};\overline{\bm{\theta}})+1$$ |
that parametrize the density $f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\overline{\bm{\theta}})=f(\text{\mathbf{p}};\bm{\alpha})$. The posterior distribution $P(y|{\mathbf{x}}^{*},\overline{\bm{\theta}})$ is given by:
$$P(y=j|{\mathbf{x}}^{*};\overline{\bm{\theta}})={\mathbb{E}}_{f(\text{\mathbf{p}}|{\mathbf{x}}^{*};\overline{\bm{\theta}})}[P(y=j|\text{\mathbf{p}})]=\frac{{\alpha}_{j}}{{\alpha}_{0}}$$ |
The concentration parameters determine the shape of the Dirichlet distribution on the probability simplex, as is visualized in Fig. 1 for $K=3$. Fig. 1(a) shows a confident prediction characterized by low entropy, (b) shows a more challenging prediction that has higher uncertainty, (c) shows a prediction characterized by high data uncertainty due to class overlap, and (d) shows a flat Dirichlet distribution that arises for an out-of-distribution example.
Predictive entropy measures total uncertainty and may be decomposed into epistemic (or knowledge) uncertainty (arises due to model’s difficulty in understanding inputs) and aleatoric (or data) uncertainty (arises due to class-overlap and noise) [28], given by:
$$H(P(y|{\mathbf{x}}^{*},\overline{\bm{\theta}}))=H({\mathbb{E}}_{f(\text{\mathbf{p}}|{\mathbf{x}}^{*};\overline{\bm{\theta}})}[P(y|\text{\mathbf{p}})])=-\sum _{j}\frac{{\alpha}_{j}}{{\alpha}_{0}}\mathrm{log}\frac{{\alpha}_{j}}{{\alpha}_{0}}$$ |
The mutual information between the labels $y$ and the class probability vector p, $I(y,\text{\mathbf{p}}|{\mathbf{x}}^{*};\overline{\bm{\theta}})$, captures epistemic uncertainty, and can be calculated by subtracting the expected data uncertainty from the total uncertainty:
$I$ | $(y,\text{\mathbf{p}}|{\mathbf{x}}^{*};\overline{\bm{\theta}})=H({\mathbb{E}}_{f(\text{\mathbf{p}}|{\mathbf{x}}^{*};\overline{\bm{\theta}})}[P(y|\text{\mathbf{p}})])-{\mathbb{E}}_{f(\text{\mathbf{p}}|{\mathbf{x}}^{*},\overline{\bm{\theta}})}[H(P(y|\text{\mathbf{p}}))]$ | ||
$\mathrm{\hspace{1em}}=-{\displaystyle \sum _{j}}{\displaystyle \frac{{\alpha}_{j}}{{\alpha}_{0}}}\left(\mathrm{log}{\displaystyle \frac{{\alpha}_{j}}{{\alpha}_{0}}}-\psi ({\alpha}_{j}+1)+\psi ({\alpha}_{0}+1)\right)$ |
This metric explicitly captures the spread due to distributional uncertainty and is particularly useful for detection of out-of-distribution and adversarial examples. A variation of it was used in the context of active learning [31].
II-C Classification Loss
Available are one-hot encoded labels ${\mathbf{y}}_{i}$ of examples ${\mathbf{x}}_{i}$ with correct class ${c}_{i}$. Treating the Dirichlet distribution ${f}_{{\bm{\alpha}}_{i}}({\text{\mathbf{p}}}_{i})$ as a prior on the multinomial likelihood function ${\prod}_{k}{p}_{ik}^{{y}_{ik}}$, one can minimize the negated log-marginal likelihood:
$-\mathrm{log}\left(\mathbb{E}\left[{\displaystyle \prod _{k}}{p}_{ik}^{{y}_{ik}}\right]\right)=-\left(\mathrm{log}({\alpha}_{i,{c}_{i}})-\mathrm{log}({\displaystyle \sum _{j}}{\alpha}_{ij})\right)$ |
or the Bayes risk of the cross-entropy loss:
$\mathbb{E}$ | $\left[-{\displaystyle \sum _{k}}{y}_{ik}\mathrm{log}{p}_{ik}\right]=-{\displaystyle \sum _{k}}{y}_{ik}{\displaystyle {\int}_{\mathcal{S}}}\mathrm{log}{p}_{ik}f({\text{\mathbf{p}}}_{i};{\bm{\alpha}}_{i})\mathit{d}{\text{\mathbf{p}}}_{i}$ | ||
$=-\left(\psi ({\alpha}_{i,{c}_{i}})-\psi ({\displaystyle \sum _{j}}{\alpha}_{ij})\right)$ |
where $\psi (\cdot )$ is the digamma function. It was observed in [27] that these loss functions generate excessively high belief masses for classes hurting quantification of uncertainty and are less stable than minimizing the sum of squares of prediction errors instead. This can be attributed to the nature of these loss functions encouraging the maximization of correct class likelihoods.
Unlike conventional cross-entropy training that only seeks to maximize the correct class likelihood, we propose a distance-based objective that minimizes the expected prediction error capturing errors across all classes simultaneously by learning the appropriate Dirichlet concentration parameters that govern the spread of class probability vectors. We propose to minimize the Bayes risk of the prediction error in ${L}_{p}$ space for $p\ge 1$, which is approximated using Jensen’s inequality as
$\mathbb{E}$ | ${\parallel {\mathbf{y}}_{i}-{\text{\mathbf{p}}}_{i}\parallel}_{p}\le {\left(\mathbb{E}[{\parallel {\mathbf{y}}_{i}-{\text{\mathbf{p}}}_{i}\parallel}_{p}^{p}]\right)}^{1/p}$ | ||
$={(\mathbb{E}[{(1-{p}_{i,{c}_{i}})}^{p}]+{\displaystyle \sum _{j\ne {c}_{i}}}\mathbb{E}[{p}_{ij}^{p}])}^{1/p}=:{\mathcal{F}}_{i}(w)$ |
This loss can interpolate between ${L}_{1}$ to ${L}_{\mathrm{\infty}}$ norms, and as $p$ grows large we minimize an approximation to the maximum prediction error, e.g., $\mathbb{E}[{\mathrm{max}}_{k}|{y}_{ik}-{p}_{ik}|]$, which is difficult to directly optimize. Jensen’s inequality yields a tractable upper bound for all values of $p$, and the ${L}_{p}$ loss encompasses higher-order moments of the Dirichlet experiment generated by the NN as opposed to just the bias and variance for the ${L}_{2}$ case. In practice, $p$ is chosen to strike a balance between the correct prediction confidence and uncertainties of errors/out-of-distribution queries.
To calculate each term in ${\mathcal{F}}_{i}(w)$, we note $1-{p}_{i,{c}_{i}}$ has a distribution $\text{Beta}({\alpha}_{i,0}-{\alpha}_{i,{c}_{i}},{\alpha}_{i,{c}_{i}})$ due to mirror symmetry, and ${p}_{ij}$ has distribution $\text{Beta}({\alpha}_{i,j},{\alpha}_{i,0}-{\alpha}_{i,j})$. Using the moment expression (2) for Beta random variables:
${\mathcal{F}}_{i}(w)=({\displaystyle \frac{{B}_{u}({\alpha}_{i,0}-{\alpha}_{i,{c}_{i}}+p,{\alpha}_{i,{c}_{i}})}{{B}_{u}({\alpha}_{i,0}-{\alpha}_{i,{c}_{i}},{\alpha}_{i,{c}_{i}})}}$ | ||
$\mathrm{\hspace{1em}\hspace{1em}}+{\displaystyle \sum _{j\ne {c}_{i}}}{\displaystyle \frac{{B}_{u}({\alpha}_{i,j}+p,{\alpha}_{i,0}-{\alpha}_{i,j})}{{B}_{u}({\alpha}_{i,j},{\alpha}_{i,0}-{\alpha}_{i,j})}}){}^{\frac{1}{p}}$ | ||
$={\left({\displaystyle \frac{\mathrm{\Gamma}({\alpha}_{0})}{\mathrm{\Gamma}({\alpha}_{0}+p)}}\right)}^{\frac{1}{p}}{\left({\displaystyle \frac{\mathrm{\Gamma}\left(\sum _{k\ne c}{\alpha}_{k}+p\right)}{\mathrm{\Gamma}\left(\sum _{k\ne c}{\alpha}_{k}\right)}}+{\displaystyle \sum _{k\ne c}}{\displaystyle \frac{\mathrm{\Gamma}({\alpha}_{k}+p)}{\mathrm{\Gamma}({\alpha}_{k})}}\right)}^{\frac{1}{p}}$ |
The following theorem shows that the loss function ${\mathcal{F}}_{i}$ has the correct behavior as the information flow increases towards the correct class which is consistent when an image sample of that class is observed in a Bayesian Dirichlet experiment and hyperparameters are incremented (see Sec. II-B).
Theorem 1.
For a given sample ${\mathrm{x}}_{i}$ with correct label $c$, the loss function ${\mathrm{F}}_{i}$ is strictly convex and decreases as ${\alpha}_{c}$ increases (and increases when ${\alpha}_{c}$ decreases).
Theorem 1 shows that our objective function encourages the learned distribution of probability vectors to concentrate towards the correct class, consistent with Dirichlet sampling experiments. While increasing information flow towards the correct class reduces the loss, it is also important for the loss to capture elements of incorrect classes. It is expected that increasing information flow towards incorrect classes increases uncertainty. The next result shows that through our loss function the model avoids assigning high concentration parameters to incorrect classes as the model cannot explain observations that are assigned incorrect outcomes.
Theorem 2.
For a given sample ${\mathrm{x}}_{i}$ with correct label $c$, the loss function ${\mathrm{F}}_{i}$ is increasing in ${\alpha}_{j}$ for any $j\mathrm{\ne}c$ as ${\alpha}_{j}$ grows.
Theorem 2 implies that our loss function leads the model to push the distribution of class probability vectors away from incorrect classes.
II-D Information Regularization Loss
The classification loss can discover interesting patterns in the data to achieve high classification accuracy. However, the network may learn that certain patterns lead to strong information flow towards incorrect classes, e.g., a common pattern of one correct class might contribute to a large ${\alpha}_{j}$ associated with an incorrect class. While for accuracy this might not be an issue as long as ${\alpha}_{c}$ is larger than the incorrect ${\alpha}_{j}$, it does affect its predictive uncertainty. Thus, it is of interest to minimize the contributions of concentration parameters associated with incorrect outcomes.
Given the auxiliary vector ${\stackrel{~}{\bm{\alpha}}}_{i}=(1-{\mathbf{y}}_{i})\odot {\bm{\alpha}}_{i}+{\mathbf{y}}_{i}$ formed by nulling out the correct class concentration parameter ${\alpha}_{{c}_{i}}$, we minimize the following distance function that aligns the concentration parameter vector $\stackrel{~}{\bm{\alpha}}$ towards unity:
${\mathcal{R}}_{i}$ | $\stackrel{\mathrm{def}}{=}{\displaystyle \frac{1}{2}}{({\stackrel{~}{\bm{\alpha}}}_{i}-\mathrm{\U0001d7cf})}^{T}\text{diag}(J({\stackrel{~}{\bm{\alpha}}}_{i}))({\stackrel{~}{\bm{\alpha}}}_{i}-\mathrm{\U0001d7cf})$ | |||
$={\displaystyle \frac{1}{2}}{\displaystyle \sum _{j\ne {c}_{i}}}{({\alpha}_{ij}-1)}^{2}({\psi}^{(1)}({\alpha}_{ij})-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{i0}))$ | (3) |
where ${\psi}^{(1)}(z)=\frac{d}{dz}\psi (z)$ is the polygamma function of order $1$, and $J(\stackrel{~}{\bm{\alpha}})$ denotes the Fisher information matrix $\mathbb{E}[\nabla \mathrm{log}f(\text{\mathbf{p}};\stackrel{~}{\bm{\alpha}})\nabla \mathrm{log}f{(\text{\mathbf{p}};\stackrel{~}{\bm{\alpha}})}^{T}]=-\mathbb{E}[{\nabla}^{2}\mathrm{log}f(\text{\mathbf{p}};\stackrel{~}{\bm{\alpha}})]$. We remark that (3) is not a quadratic function in ${\alpha}_{ij}$ due to the nonlinearity of the polygamma functions and the fact that terms are tied together through the constraint ${\stackrel{~}{\alpha}}_{i0}=1+{\sum}_{j\ne c}{\alpha}_{ij}$. This regularization is related to a local approximation of the Rényi information divergence [32, 33] of the Dirichlet distribution $f(\text{\mathbf{p}};\stackrel{~}{\bm{\alpha}})$ from the uniform Dirichlet $f(\text{\mathbf{p}};\mathrm{\U0001d7cf})$ given by
${D}_{u}^{R}(f(\text{\mathbf{p}};\stackrel{~}{\bm{\alpha}})\parallel f(\text{\mathbf{p}};\mathrm{\U0001d7cf}))\cong {\displaystyle \frac{u}{2}}{(\stackrel{~}{\bm{\alpha}}-\mathrm{\U0001d7cf})}^{T}J(\stackrel{~}{\bm{\alpha}})(\stackrel{~}{\bm{\alpha}}-\mathrm{\U0001d7cf})$ | ||
$={\displaystyle \frac{u}{2}}[{\displaystyle \sum _{j\ne c}}{({\alpha}_{j}-1)}^{2}({\psi}^{(1)}({\alpha}_{j})-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{0}))$ | ||
$\mathrm{\hspace{1em}}-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{0}){\displaystyle \sum _{i\ne j,i\ne c,j\ne c}}({\alpha}_{i}-1)({\alpha}_{j}-1)]$ |
in the local regime ${\parallel \stackrel{~}{\bm{\alpha}}-\mathrm{\U0001d7cf}\parallel}_{2}^{2}={\sum}_{j\ne c}{({\alpha}_{j}-1)}^{2}\to 0$. This approximation follows from [34] (p. 2472) after using the second-order Taylor’s expansion and substituting the Fisher information matrix $J(\stackrel{~}{\bm{\alpha}})=\text{diag}({\{{\psi}^{(1)}({\stackrel{~}{\alpha}}_{i})\}}_{i=1}^{K})-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{0}){1}_{K\times K}$. The next theorem shows a desirable monotonicity property of the information regularization loss (3).
Theorem 3.
The information regularization loss $\mathrm{R}\mathit{}\mathrm{(}\alpha \mathrm{)}$ given in (3) is increasing in ${\alpha}_{j}$ for $j\mathrm{\ne}c$.
The total loss to be minimized, per example, is:
$${\mathcal{G}}_{i}={\mathcal{F}}_{i}+\lambda {\mathcal{R}}_{i}$$ |
where $\lambda $ is a nonnegative parameter controlling the tradeoff between minimizing the approximate Bayes risk and the information regularization penalty. The total loss is summed over a batch of training samples $\mathcal{G}(\bm{\theta})={\sum}_{i=1}^{N}{\mathcal{G}}_{i}(\bm{\theta})$. Training is performed using minibatches with $\lambda $ increasing using an annealing schedule, e.g., ${\lambda}_{t}=\lambda \mathrm{min}\{\frac{t-{T}_{0}}{T},1\}$ for $t>{T}_{0}$ for rate parameter (e.g. $T=60$) and ${\lambda}_{t}=0$ for $t\le {T}_{0}$. The parameter ${T}_{0}$ should be chosen large enough to allow the network to learn interesting features useful for classification and avoid incorporating the regularization effect too early which may lead to learning difficulties.
Theorem 3 combined with Theorem 2 imply that the strength of concentration parameters associated with misleading outcomes is expected to decrease during training. This preferable behavior of our objective function leads to higher uncertainties for misclassifications as the concentration parameters are all aimed to be minimized instead of allowing one to be much larger than others.
III Experimental Results
All experiments are implemented in Tensorflow [35] and the Adam [36] optimizer was used for training. As recent prior works [27, 28] have shown Dirichlet NNs outperform BNNs on several benchmark image datasets, we mainly focus on comparing our method with these Dirichlet NNs trained with different loss functions. Comparisons are made with the following methods: (a) L2 corresponds to deterministic neural network with softmax output and weight decay, (b) Dropout is the uncertainty estimation method of [16], (c) EDL is the evidential approach of [27], (d) RKLPN is the reverse KL divergence-based prior network method of [28], and (e) IRD is our proposed technique.
III-A Fashion-MNIST Dataset
The LeNet CNN architecture with $20$ and $50$ filters of size $5\times 5$ is used for the Fashion-MNIST dataset [37] with $500$ hidden units at the dense layer. The training set contains $60,000$ digits and the testing set contains $10,000$. The results were generated with $\lambda =0.5,p=4$. Table I shows the test accuracy on MNIST for these methods; IRD is shown to be competitive assigning low uncertainty to correct predictions and high uncertainty to errors. In general, a small accuracy loss is expected as the NN is trained so that data examples near the decision boundary (likely errors) lie in a high-uncertainty region that might affect predictions of nearby data; this can be mitigated by adjusting $\lambda $ or $p$. However, our results show that accuracy loss is not significant and OOD/adversarial uncertainty quantification improves upon prior methods while maintaining low uncertainty on correct predictions.
Method | Accuracy | Median %Max-Entropy: Correct | Median %Max-Entropy: Errors |
L2 | 91.4 | 1 | 29 |
Dropout | 91.4 | 7 | 40 |
RKLPN | 92.5 | 21 | 52 |
EDL | 91.6 | 25 | 65 |
IRD | 90.1 | 9 | 100 |
To measure within-distribution uncertainty, Fig. 2 shows the distribution of entropies of predictive distributions for correct and misclassified examples across competing methods. The overconfidence of conventional L2 NNs is evident since the distribution mass of correct and wrong predictions is concentrated on lower uncertainties. The Dirichlet-based methods, EDL and RKLPN, tend to sacrifice correct class confidence for providing higher uncertainties on misclassified examples. IRD offers a drastic improvement over all methods with $63\%$ of the misclassified samples falling within $95\%$ of the max-entropy ($\mathrm{log}10\approx 2.3$), as opposed to $3\%$ and $4\%$ of the misclassified samples of the RKLPN and EDL methods respectively.
To evaluate out-of-distribution uncertainty quantification, the trained model on Fashion-MNIST is tested with image data from different datasets. Specifically, IRD is tested on notMNIST [38] which contains only English letters, and OmniGlot [39] which contains characters from multiple alphabets, serving as out-of-distribution data. The uncertainty is expected to be high for all such images as they do not fit into any trained category. Figures 3 and 4 shows the empirical CDF of the predictive entropy and mutual information. CDF curves close to the bottom right are more desirable as higher entropy is desired for all predictions. IRD is much more tightly concentrated towards higher entropy values; for notMNIST/OmniGlot, an impressive $60\%$/$72\%$ of images have entropy larger than $95\%$ of the max-entropy, while EDL and PN have $5\%$/$10\%$ and $9\%$/$14\%$ approximately.
Adversarial uncertainty quantification on Fashion-MNIST was also evaluated. Fig. 5 shows the adversarial performance when each model is evaluated using adversarial examples generated with the Fast Gradient Sign method (FGSM) [25] for different noise values $\u03f5$, i.e., ${\mathbf{x}}_{adv}=\mathbf{x}+\u03f5\text{sgn}({\nabla}_{\mathbf{x}}\mathcal{F}(\mathbf{x},y,w))$. We observe that IRD achieves higher entropy on adversarial examples as $\u03f5$ increases than other methods while achieving a lower average predictive entropy for $\u03f5=0$ due to the higher confidence of correct predictions. Interestingly, a large entropy is assigned to misclassified samples as Fig. 2 shows.
III-B CIFAR-10 Dataset
A VGG-based CNN architecture consisting of three filter blocks with $64,128,256$ filters respectively with filter sizes $3\times 3$ was used for the CIFAR-10 dataset [40] with $256$ hidden units at the dense layer. The training/testing set is made up of $60,000$/$10,000$ training examples. Regularization parameter $\lambda =0.3$ was adopted with $p=4$. Data augmentation, dropout and batch-normalization was used for all methods to mitigate overfitting. Table II shows the test accuracy on CIFAR-10 for these methods; IRD is shown to be competitive assigning low uncertainty to correct predictions and high uncertainty to errors.
Method | Accuracy | Median %Max-Entropy: Correct | Median %Max-Entropy: Errors |
L2 | 85.2 | 1 | 36 |
Dropout | 86.7 | 6 | 48 |
RKLPN | 85.1 | 18 | 51 |
EDL | 87.8 | 24 | 55 |
IRD | 85.6 | 12 | 66 |
Within-distribution uncertainty quantification is evaluated in Fig. 6 which shows the distribution of entropies of predictive distributions for correct and misclassified examples across competing methods. Similar to the previous set of results, conventional L2 NNs yield overconfident predictions and EDL and RKLPN sacrifice correct class confidence for providing higher uncertainties on misclassified examples. IRD offers an improvement over all methods as the tail of the distribution of predictive entropies associated with misclassified examples is more heavily concentrated on higher values, while maintaining an improved correct prediction confidence over other Dirichlet neural networks.
For out-of-distribution testing, IRD is tested on Tiny-ImageNet [41] which contains a small subset of ILSVRC spanning 200 image classes, and SVHN [42] which contains street view house numbers. The uncertainty is expected to be high for all such images as they do not fit into any trained category. Figures 7 and 8 show the empirical CDF of the predictive entropy and mutual information. IRD is shown to improve upon competing methods as it concentrates more heavily towards higher entropy and mutual information values. The benefit is observed for both uncertainty metrics.
The adversarial performance for CIFAR-10 is shown in Fig. 9 under FGSM adversarial attacks as a function of noise $\u03f5$. It is observed that IRD starts at low predictive entropy/mutual information and quickly increases its uncertainty as more adversarial noise is added in the system and the image moves farther away from the data manifold.
IV Conclusion
In this work, we presented a new method for training Dirichlet neural networks that are aware of the uncertainty associated with predictions. Our training objective fits predictive distributions to data using a classification loss that minimizes the expected prediction error measured in ${L}_{p}$ space, and an information regularization loss that penalizes information flow towards incorrect classes. We derived closed-form expressions for our training loss and desirable properties on how improved uncertainty estimation is achieved. Experimental results were shown on image classification tasks, highlighting improvements in predictive uncertainty estimation for within-distribution, out-of-distribution and adversarial queries made by our method over conventional neural networks with weight decay, Bayesian neural networks, and other recent Dirichlet networks trained with different loss functions.
Appendix
We make use of the following lemmas in the proofs.
Lemma 1.
Consider the digamma function $\psi $. Assuming ${x}_{\mathrm{1}}\mathrm{>}{x}_{\mathrm{2}}\mathrm{>}\mathrm{1}$ and $p\mathrm{>}\mathrm{0}$, the following inequality strictly holds:
$$ |
Furthermore, we have ${\mathrm{lim}}_{x\mathrm{\to}\mathrm{\infty}}\mathit{}\psi \mathit{}\mathrm{(}x\mathrm{+}p\mathrm{)}\mathrm{-}\psi \mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}\mathrm{0}$.
Proof.
Since ${x}_{1}>{x}_{2}>1$, we can write ${x}_{1}={s}_{1}+1$ and ${x}_{2}={s}_{2}+1$ for some ${s}_{1}>{s}_{2}$. Upon substitution of the Gauss integral representation $\psi (z+1)=-\gamma +{\int}_{0}^{1}\left(\frac{1-{t}^{z}}{1-t}\right)\mathit{d}t$ (here $\gamma $ is the Euler-Mascheroni constant), we have:
$$\psi ({x}_{1})-\psi ({x}_{2})={\int}_{0}^{1}\left(\frac{{t}^{{s}_{2}}-{t}^{{s}_{1}}}{1-t}\right)\mathit{d}t$$ |
which is strictly positive since the integrand is positive for $t\in (0,1)$. Using the integral representation again, the inequality $$ is equivalent to:
$${\int}_{0}^{1}\left(\frac{(1-{t}^{p})({t}^{{s}_{2}}-{t}^{{s}_{1}})}{1-t}\right)>0$$ |
which holds since the integrand is positive due to $$ an $$. The limit of $\psi (x+p)-\psi (x)$ follows from the asymptotic expansion $\psi (x)=\mathrm{log}(x)-\frac{1}{2x}+O\left(\frac{1}{{x}^{2}}\right)$, which yields $\psi (x+p)-\psi (x)\sim \mathrm{log}(1+p/x)-\frac{1}{2(x+p)}+\frac{1}{2x}\to 0$ as $x\to \mathrm{\infty}$. This concludes the proof. ∎
Lemma 2.
Consider the polygamma function of order 1 ${\psi}^{\mathrm{(}\mathrm{1}\mathrm{)}}\mathit{}\mathrm{(}z\mathrm{)}\mathrm{=}\frac{d}{d\mathit{}z}\mathit{}\psi \mathit{}\mathrm{(}z\mathrm{)}$. Assuming ${x}_{\mathrm{1}}\mathrm{>}{x}_{\mathrm{2}}\mathrm{>}\mathrm{1}$ and $p\mathrm{>}\mathrm{0}$, the following inequality strictly holds:
$$ |
Proof.
Proceeding similarly as in the Proof of Lemma 1, we write ${x}_{1}={s}_{1}+1$ and ${x}_{2}={s}_{2}+1$ for some ${s}_{1}>{s}_{2}$. Upon substitution of the integral representation ${\psi}^{(1)}(z+1)={\int}_{0}^{1}\left(\frac{{t}^{z}}{1-t}\mathrm{ln}\left(\frac{1}{t}\right)\right)\mathit{d}t$, we have:
$${\psi}^{(1)}({x}_{1})-{\psi}^{(1)}({x}_{2})={\int}_{0}^{1}\left(\frac{{t}^{{s}_{1}}-{t}^{{s}_{2}}}{1-t}\mathrm{ln}\left(\frac{1}{t}\right)\right)\mathit{d}t$$ |
which is strictly negative since the integrand is negative for $t\in (0,1)$. Using the integral representation again, the inequality $$ is equivalent to:
$$ |
which holds true since $\mathrm{ln}(1/t)>0$ for $t\in (0,1)$. This concludes the proof. ∎
Proof of Theorem 1
Proof.
Taking the logarithm of ${\mathcal{F}}_{i}$, we have:
$\mathrm{log}{\mathcal{F}}_{i}$ | $={\displaystyle \frac{1}{p}}\mathrm{log}\left({\displaystyle \frac{\mathrm{\Gamma}({\alpha}_{0})}{\mathrm{\Gamma}({\alpha}_{0}+p)}}\right)$ | ||
$+{\displaystyle \frac{1}{p}}\mathrm{log}\left({\displaystyle \frac{\mathrm{\Gamma}({\sum}_{k\ne c}{\alpha}_{k}+p)}{\mathrm{\Gamma}({\sum}_{k\ne c}{\alpha}_{k})}}+{\displaystyle \sum _{j\ne c}}{\displaystyle \frac{\mathrm{\Gamma}({\alpha}_{j}+p)}{\mathrm{\Gamma}({\alpha}_{j})}}\right)$ |
where the second term is independent of ${\alpha}_{c}$. Letting the first term be denoted as $g({\alpha}_{c}):=\frac{1}{p}\mathrm{log}\left(\frac{\mathrm{\Gamma}({\alpha}_{0})}{\mathrm{\Gamma}({\alpha}_{0}+p)}\right)$, it suffices to show $f({\alpha}_{c}):=\mathrm{exp}(g({\alpha}_{c}))$ is strictly convex and decreasing in ${\alpha}_{c}$.
Differentiating $g({\alpha}_{c})$ twice we obtain:
${g}^{\prime}({\alpha}_{c})$ | $={\displaystyle \frac{1}{p}}\left(\psi ({\alpha}_{0})-\psi ({\alpha}_{0}+p)\right)$ | ||
${g}^{\prime \prime}({\alpha}_{c})$ | $={\displaystyle \frac{1}{p}}\left({\psi}^{(1)}({\alpha}_{0})-{\psi}^{(1)}({\alpha}_{0}+p)\right)$ |
Lemmas 1 and 2 then yield that $$ and ${g}^{\prime \prime}({\alpha}_{c})>0$ respectively. Differentiating $f({\alpha}_{c})$ twice, we have:
${f}^{\prime}({\alpha}_{c})$ | $={e}^{g({\alpha}_{c})}{g}^{\prime}({\alpha}_{c})$ | ||
${f}^{\prime \prime}({\alpha}_{c})$ | $={e}^{g({\alpha}_{c})}\left({g}^{\prime \prime}({\alpha}_{c})+{({g}^{\prime}({\alpha}_{c}))}^{2}\right)$ |
Using the inequalities above and the positivity of ${e}^{g({\alpha}_{c})}$, it follows that $$ and ${f}^{\prime \prime}({\alpha}_{c})>0$. Thus, $f({\alpha}_{c})$ is a strictly convex decreasing function in ${\alpha}_{c}$. This concludes the proof. ∎
Proof of Theorem 2
Proof.
Consider a concentration parameter ${\alpha}_{j}$ corresponding to an incorrect class, i.e., $j\ne c$. Define the ratio of Gamma functions as:
$$\mu (\alpha )\stackrel{\mathrm{def}}{=}\frac{\mathrm{\Gamma}(\alpha +p)}{\mathrm{\Gamma}(\alpha )}$$ |
This function is positive, increasing and convex with derivative given by:
${\mu}^{\prime}(\alpha )$ | $=-{\displaystyle \frac{\mathrm{\Gamma}(\alpha +p){\mathrm{\Gamma}}^{\prime}(\alpha )}{\mathrm{\Gamma}{(\alpha )}^{2}}}+{\displaystyle \frac{{\mathrm{\Gamma}}^{\prime}(\alpha +p)}{\mathrm{\Gamma}(\alpha )}}$ | |||
$=-{\displaystyle \frac{\mathrm{\Gamma}(\alpha +p)\psi (\alpha )}{\mathrm{\Gamma}(\alpha )}}+{\displaystyle \frac{\mathrm{\Gamma}(\alpha +p)\psi (\alpha +p)}{\mathrm{\Gamma}(\alpha )}}$ | ||||
$=\mu (\alpha )\left(\psi (\alpha +p)-\psi (\alpha )\right)$ | ||||
$=\mu (\alpha )\nu (\alpha )$ | (4) |
where we used the relation ${\mathrm{\Gamma}}^{\prime}(z)=\mathrm{\Gamma}(z)\psi (z)$ and defined
$$\nu (\alpha )\stackrel{\mathrm{def}}{=}\psi (\alpha +p)-\psi (\alpha ).$$ |
From Lemma 1, it follows that $\nu (\alpha )>0$ which implies $\mu (\alpha )$ is increasing.
Since ${(\cdot )}^{1/p}$ is a continuous increasing function, it suffices to show the objective $\mathcal{G}={\mathcal{F}}_{i}^{p}$ is increasing, given by $\mathcal{G}({\alpha}_{j})=\left(\mu \left(\sum _{l\ne c}{\alpha}_{l}\right)+\sum _{l\ne c}\mu ({\alpha}_{l})\right)/\mu ({\alpha}_{0})$. The derivative is then calculated as:
${\mathcal{G}}^{\prime}({\alpha}_{j})$ | $={\displaystyle \frac{{\mu}^{\prime}\left(\sum _{l\ne c}{\alpha}_{l}\right)+{\mu}^{\prime}({\alpha}_{j})}{\mu ({\alpha}_{0})}}$ | ||
$\mathrm{\hspace{1em}}-{\displaystyle \frac{{\mu}^{\prime}({\alpha}_{0})\cdot \left[\mu \left(\sum _{l\ne c}{\alpha}_{l}\right)+\sum _{l\ne c}\mu ({\alpha}_{l})\right]}{\mu ({\alpha}_{0})}}$ |
The condition ${\mathcal{G}}^{\prime}({\alpha}_{j})>0$ is equivalent to:
$$\frac{{\mu}^{\prime}\left(\sum _{l\ne c}{\alpha}_{l}\right)+{\mu}^{\prime}({\alpha}_{j})}{{\mu}^{\prime}({\alpha}_{0})}>\frac{\mu \left(\sum _{l\ne c}{\alpha}_{l}\right)+\sum _{l\ne c}\mu ({\alpha}_{l})}{\mu ({\alpha}_{0})}=\mathcal{G}$$ |
Upon substituting the expression (4), this condition becomes:
$\mu \left({\displaystyle \sum _{l\ne c}}{\alpha}_{l}\right)\nu \left({\displaystyle \sum _{l\ne c}}{\alpha}_{l}\right)+\mu ({\alpha}_{j})\nu ({\alpha}_{j})$ | |||
$\mathrm{\hspace{1em}}>\left[\mu \left({\displaystyle \sum _{l\ne c}}{\alpha}_{l}\right)+{\displaystyle \sum _{l\ne c}}\mu ({\alpha}_{l})\right]\nu ({\alpha}_{0})$ | (5) |
From Lemma 1, it follows that $\nu \left(\sum _{l\ne c}{\alpha}_{l}\right)>\nu ({\alpha}_{0})$ and $\nu ({\alpha}_{j})>\nu ({\alpha}_{0})$. In addition, the functions $\mu \left(\sum _{l\ne c}{\alpha}_{l}\right)\nu \left(\sum _{l\ne c}{\alpha}_{l}\right)$ and $\mu ({\alpha}_{j})\nu ({\alpha}_{j})$ are both increasing as ${\alpha}_{j}$ grows. Using these results and the fact that $\left[\sum _{l\ne c,j}\mu ({\alpha}_{l})\right]\nu ({\alpha}_{0})\to 0$ as ${\alpha}_{j}$ grows (due to Lemma 1), it follows that the inequality (5) holds true for large ${\alpha}_{j}$. Thus, we conclude that the loss function is increasing as ${\alpha}_{j}$ gets large. The proof is complete. ∎
An illustration of Theorem 2 is shown in Fig. 10 below. An approximate loss function is also shown due to ${lim}_{\alpha \to \mathrm{\infty}}\frac{\mathrm{\Gamma}(\alpha +p)}{\mathrm{\Gamma}(\alpha ){\alpha}^{p}}=1$, from which we obtain the approximation $\mu (\alpha )\sim {\alpha}^{p}$. This approximation to the loss behaves similarly. Despite the initial dip, the loss is increasing as ${\alpha}_{j}$ increases. We remark that the loss is neither convex nor concave in ${\alpha}_{j}$.
Proof of Theorem 3
Proof.
Consider $\mathcal{R}({\alpha}_{k})$ as a function of ${\alpha}_{k}$ for some $k\ne c$. Then, it may be decomposed as $\mathcal{R}({\alpha}_{k})={\mathcal{R}}_{k}({\alpha}_{k})+{\mathcal{R}}_{\ne k}({\alpha}_{k})$ where
${\mathcal{R}}_{k}({\alpha}_{k})$ | $={\displaystyle \frac{1}{2}}{({\alpha}_{k}-1)}^{2}({\psi}^{(1)}({\alpha}_{k})-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{0}))$ | ||
${\mathcal{R}}_{\ne k}({\alpha}_{k})$ | $={\displaystyle \frac{1}{2}}{\displaystyle \sum _{j\ne c,j\ne k}}{({\alpha}_{j}-1)}^{2}({\psi}^{(1)}({\alpha}_{j})-{\psi}^{(1)}({\stackrel{~}{\alpha}}_{0}))$ |
The first term is an increasing function since $q(\alpha )={(\alpha -1)}^{2}({\psi}^{(1)}(\alpha )-{\psi}^{(1)}(\alpha +z))$ is increasing for any $z>1$. The second term is also increasing since
$\frac{\partial {\mathcal{R}}_{\ne k}({\alpha}_{k})}{\partial {\alpha}_{k}}$ | $={\displaystyle \frac{-{\psi}^{(2)}({\stackrel{~}{\alpha}}_{0})}{2}}{\displaystyle \sum _{j\ne c,j\ne k}}{({\alpha}_{j}-1)}^{2}\ge 0$ |
which follows from the integral representation ${\psi}^{(2)}(x)=-{\int}_{0}^{\mathrm{\infty}}\frac{{t}^{2}{e}^{-tx}}{1-{e}^{-t}}\mathit{d}t\le 0$. ∎
References
- [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7533, pp. 436–444, 2015.
- [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS), 2012.
- [3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, 2014.
- [4] Y. Wu et al., “Google‘s neural machine translation system: Bridging the gap between human and machine translation,” Tech. Rep., 2016, arXiv:1609.08144.
- [5] R. Geirhos, C. R. M. Temme, J. Rauber, M. Bethge, and F. A. Wichmann, “Generalization in humans and deep neural networks,” in Advances in Neural Information Processing Systems, 2018.
- [6] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-level Performance on ImageNet classification,” in IEEE International Conference on Computer Vision (ICCV), December 2015.
- [7] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, “Multi-column deep neural network for traffic sign classification,” Neural Networks, vol. 32, pp. 333–338, 2012.
- [8] W. Xiong, J. Droppo, X. Huang, F. Seide, M. L. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Toward Human Parity in Conversational Speech Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 25, no. 12, pp. 2410–2423, December 2017.
- [9] G. Hinton, L. Deng et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
- [10] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep Learning for Identifying Metastatic Breast Cancer,” Tech. Rep., June 2016, arXiv:1606.05718.
- [11] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning,” Nature biotechnology, vol. 33, no. 8, pp. 831–838, 2015.
- [12] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” in International Conference on Machine Learning, 2017.
- [13] C. Louizos and M. Welling, “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks,” in International Conference on Machine Learning (ICML), 2017.
- [14] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight Uncertainty in Neural Networks,” in International Conference on Machine Learning (ICML), 2015.
- [15] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in Advances in Neural Information Processing (NIPS), 2015.
- [16] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning,” in International Conference on Machine Learning (ICML), 2016.
- [17] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in International Conference on Machine Learning (ICML), 2017.
- [18] Y. Li and Y. Gal, “Dropout inference in Bayesian neural networks with alpha-divergences,” in International Conference on Machine Learning, 2017.
- [19] D. J. MacKay, “A practical Bayesian framework for backpropagation networks,” Neural Computation, vol. 4, no. 3, pp. 448–472, 1992.
- [20] H. Ritter, A. Botev, and D. Barber, “A Scalable Laplace Approximation for Neural Networks,” in International Conference on Learning Representations, 2018.
- [21] J. M. Hernandez-Lobato and R. P. Adams, “Probabilistic backpropagation for scalable learning of bayesian neural networks,” in International Conference on Machine Learning, 2015.
- [22] S. Sun, C. Chen, and L. Carin, “Learning Structured Weight Uncertainty in Bayesian Neural Networks,” in International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- [23] T. Chen, E. Fox, and C. Guestrin, “Stochastic Gradient Hamiltonian Monte Carlo,” in International Conference on Machine Learning, 2014.
- [24] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles,” in Advances in Neural Information Processing Systems, 2017.
- [25] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,” in International Conference for Learning Representations, 2014.
- [26] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial Machine Learning at Scale,” in International Conference for Learning Representations, 2017.
- [27] M. Sensoy, L. Kaplan, and M. Kandemir, “Evidential Deep Learning to Quantify Classification Uncertainty,” in Advances in Neural Information Processing Systems (NIPS) 31, 2018.
- [28] A. Malinin and M. Gales, “Reverse KL-Divergence Training of Prior Networks: Improved Uncertainty and Adversarial Robustness,” Tech. Rep., 2019, arXiv:1905.13472.
- [29] J. G. Mauldon, “A generalization of the Beta-distributions,” Annals of Mathematical Statistics, vol. 30, pp. 502–520, 1959.
- [30] J. E. Mosimann, “On the compound multinomial distribution, the multivariate beta-distribution, and correlations among proportions,” Biometrika, vol. 49, pp. 65–82, 1962.
- [31] N. Houlsby, F. Huszar, Z. Ghahramani, and M. Lengyel, “Bayesian Active Learning for Classification and Preference Learning,” Tech. Rep., 2011, arXiv:1112.5745.
- [32] A. Rényi, “On measures of entropy and information,” in Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 1961, pp. 547–561.
- [33] T. V. Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
- [34] D. Haussler and M. Opper, “Mutual Information, Metric Entropy and Cumulative Relative Entropy Risk,” The Annals of Statistics, vol. 25, no. 6, pp. 2451–2492, 1997.
- [35] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’16. Berkeley, CA, USA: USENIX Association, 2016, pp. 265–283. [Online]. Available: http://dl.acm.org/citation.cfm?id=3026877.3026899
- [36] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference for Learning Representations, 2015.
- [37] H. Xiao, K. Rasul, and R. Vollgraf. (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
- [38] Y. Bulatov. (2011) notMNIST dataset. [Online]. Available: http://yaroslavvb.com/upload/notMNIST/
- [39] B. Lake, R. Salakhutdinov, and J. B. Tenenbaum. (2015) OmniGlot dataset. [Online]. Available: https://github.com/brendenlake/omniglot
- [40] A. Krizhevsky. The CIFAR-10 Dataset. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
- [41] F.-F. Li, J. Johnson, and S. Yeung. (2017) Tiny ImageNet. [Online]. Available: http://cs231n.stanford.edu/tiny-imagenet-200.zip
- [42] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. (2011) Reading Digits in Natural Images with Unsupervised Feature Learning. [Online]. Available: http://ufldl.stanford.edu/housenumbers/