Abstract
We formulate language modeling as a matrix factorization problem, and showthat the expressiveness of Softmax-based models (including the majority ofneural language models) is limited by a Softmax bottleneck. Given that naturallanguage is highly context-dependent, this further implies that in practiceSoftmax with distributed word embeddings does not have enough capacity to modelnatural language. We propose a simple and effective method to address thisissue, and improve the state-of-the-art perplexities on Penn Treebank andWikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels onthe large-scale 1B Word dataset, outperforming the baseline by over 5.6 pointsin perplexity.
Quick Read (beta)
loading the full paper ...