Attention-based neural networks have achieved state-of-the-art results on awide range of tasks. Most such models use deterministic attention whilestochastic attention is less explored due to the optimization difficulties orcomplicated model design. This paper introduces Bayesian attention beliefnetworks, which construct a decoder network by modeling unnormalized attentionweights with a hierarchy of gamma distributions, and an encoder network bystacking Weibull distributions with a deterministic-upward-stochastic-downwardstructure to approximate the posterior. The resulting auto-encoding networkscan be optimized in a differentiable way with a variational lower bound. It issimple to convert any models with deterministic attention, including pretrainedones, to the proposed Bayesian attention belief networks. On a variety oflanguage understanding tasks, we show that our method outperforms deterministicattention and state-of-the-art stochastic attention in accuracy, uncertaintyestimation, generalization across domains, and robustness to adversarialattacks. We further demonstrate the general applicability of our method onneural machine translation and visual question answering, showing greatpotential of incorporating our method into various attention-related tasks.