Quantum Statistics-Inspired Neural Attention

Abstract

Sequence-to-sequence (encoder-decoder) models with attention constitute acornerstone of deep learning research, as they have enabled unprecedentedsequential data modeling capabilities. This effectiveness largely stems fromthe capacity of these models to infer salient temporal dynamics over longhorizons; these are encoded into the obtained neural attention (NA)distributions. However, existing NA formulations essentially constitutepoint-wise selection mechanisms over the observed source sequences; that is,attention weights computation relies on the assumption that each sourcesequence element is independent of the rest. Unfortunately, althoughconvenient, this assumption fails to account for higher-order dependencieswhich might be prevalent in real-world data. This paper addresses theselimitations by leveraging Quantum-Statistical modeling arguments. Specifically,our work broadens the notion of NA, by attempting to account for the case thatthe NA model becomes inherently incapable of discerning between individualsource elements; this is assumed to be the case due to higher-order temporaldynamics. On the contrary, we postulate that in some cases selection may befeasible only at the level of pairs of source sequence elements. To this end,we cast NA into inference of an attention density matrix (ADM) approximation.We derive effective training and inference algorithms, and evaluate ourapproach in the context of a machine translation (MT) application. We performexperiments with challenging benchmark datasets. As we show, our approachyields favorable outcomes in terms of several evaluation metrics.

Quick Read (beta)

loading the full paper ...