Abstract
Although neural language models are effective at capturing statistics ofnatural language, their representations are challenging to interpret. Inparticular, it is unclear how these models retain information over multipletimescales. In this work, we construct explicitly multi-timescale languagemodels by manipulating the input and forget gate biases in a long short-termmemory (LSTM) network. The distribution of timescales is selected toapproximate power law statistics of natural language through a combination ofexponentially decaying memory cells. We then empirically analyze the timescaleof information routed through each part of the model using word ablationexperiments and forget gate visualizations. These experiments show that themulti-timescale model successfully learns representations at the desiredtimescales, and that the distribution includes longer timescales than astandard LSTM. Further, information about high-,mid-, and low-frequency wordsis routed preferentially through units with the appropriate timescales. Thus weshow how to construct language models with interpretable representations ofdifferent information timescales.