Abstract
Despite recent advances in training recurrent neural networks (RNNs),capturing long-term dependencies in sequences remains a fundamental challenge.Most approaches use backpropagation through time (BPTT), which is difficult toscale to very long sequences. This paper proposes a simple method that improvesthe ability to capture long term dependencies in RNNs by adding an unsupervisedauxiliary loss to the original objective. This auxiliary loss forces RNNs toeither reconstruct previous events or predict next events in a sequence, makingtruncated backpropagation feasible for long sequences and also improving fullBPTT. We evaluate our method on a variety of settings, including pixel-by-pixelimage classification with sequence lengths up to 16\,000, and a real documentclassification benchmark. Our results highlight good performance and resourceefficiency of this approach over competitive baselines, including otherrecurrent models and a comparable sized Transformer. Further analyses revealbeneficial effects of the auxiliary loss on optimization and regularization, aswell as extreme cases where there is little to no backpropagation.