Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Abstract

Language exhibits structure at different scales, ranging from subwords towords, sentences, paragraphs, and documents. To what extent do deep modelscapture information at these scales, and can we force them to better capturestructure across this hierarchy? We approach this question by focusing onindividual neurons, analyzing the behavior of their activations at differenttimescales. We show that signal processing provides a natural framework forseparating structure across scales, enabling us to 1) disentanglescale-specific information in existing embeddings and 2) train models to learnmore about particular scales. Concretely, we apply spectral filters to theactivations of a neuron across an input, producing filtered embeddings thatperform well on part of speech tagging (word-level), dialog speech actsclassification (utterance-level), or topic classification (document-level),while performing poorly on the other tasks. We also present a prism layer fortraining models, which uses spectral filters to constrain different neurons tomodel structure at different scales. Our proposed BERT + Prism model can betterpredict masked tokens using long-range context and produces multiscalerepresentations that perform better at utterance- and document-level tasks. Ourmethods are general and readily applicable to other domains besides language,such as images, audio, and video.

Quick Read (beta)

loading the full paper ...