textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

Abstract

We address two challenges of probabilistic topic modelling in order to betterestimate the probability of a word in a given context, i.e., P(word|context):(1) No Language Structure in Context: Probabilistic topic models ignore wordorder by summarizing a given context as a "bag-of-word" and consequently thesemantics of words in the context is lost. The LSTM-LM learns a vector-spacerepresentation of each word by accounting for word order in local collocationpatterns and models complex characteristics of language (e.g., syntax andsemantics), while the TM simultaneously learns a latent representation from theentire document and discovers the underlying thematic structure. We unite twocomplementary paradigms of learning the meaning of word occurrences bycombining a TM and a LM in a unified probabilistic framework, named asctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents:In settings with a small number of word occurrences (i.e., lack of context) inshort text or data sparsity in a corpus of few documents, the application ofTMs is challenging. We address this challenge by incorporating externalknowledge into neural autoregressive topic models via a language modellingapproach: we use word embeddings as input of a LSTM-LM with the aim to improvethe word-topic mapping on a smaller and/or short-text corpus. The proposedDocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled withneural LMs and embeddings priors that consistently outperform state-of-the-artgenerative TMs in terms of generalization (perplexity), interpretability (topiccoherence) and applicability (retrieval and classification) over 6 long-textand 8 short-text datasets from diverse domains.

Quick Read (beta)

loading the full paper ...