Revisiting Simple Neural Probabilistic Language Models

Abstract

Recent progress in language modeling has been driven not only by advances inneural architectures, but also through hardware and optimization improvements.In this paper, we revisit the neural probabilistic language model (NPLM)of~\citet{Bengio2003ANP}, which simply concatenates word embeddings within afixed window and passes the result through a feed-forward network to predictthe next word. When scaled up to modern hardware, this model (despite its manylimitations) performs much better than expected on word-level language modelbenchmarks. Our analysis reveals that the NPLM achieves lower perplexity than abaseline Transformer with short input contexts but struggles to handlelong-term dependencies. Inspired by this result, we modify the Transformer byreplacing its first self-attention layer with the NPLM's local concatenationlayer, which results in small but consistent perplexity decreases across threeword-level language modeling datasets.

Quick Read (beta)

loading the full paper ...