textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

Abstract

We address two challenges of probabilistic topic modelling in order to betterestimate the probability of a word in a given context, i.e., P(word|context):(1) No Language Structure in Context: Probabilistic topic models ignore wordorder by summarizing a given context as a "bag-of-word" and consequently thesemantics of words in the context is lost. The LSTM-LM learns a vector-spacerepresentation of each word by accounting for word order in local collocationpatterns and models complex characteristics of language (e.g., syntax andsemantics), while the TM simultaneously learns a latent representation from theentire document and discovers the underlying thematic structure. We unite twocomplementary paradigms of learning the meaning of word occurrences bycombining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework,named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus ofdocuments: In settings with a small number of word occurrences (i.e., lack ofcontext) in short text or data sparsity in a corpus of few documents, theapplication of TMs is challenging. We address this challenge by incorporatingexternal knowledge into neural autoregressive topic models via a languagemodelling approach: we use word embeddings as input of a LSTM-LM with the aimto improve the word-topic mapping on a smaller and/or short-text corpus. Theproposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled withneural LMs and embeddings priors that consistently outperform state-of-the-artgenerative TMs in terms of generalization (perplexity), interpretability (topiccoherence) and applicability (retrieval and classification) over 6 long-textand 8 short-text datasets from diverse domains.

Quick Read (beta)

loading the full paper ...