R-Transformer: Recurrent Neural Network Enhanced Transformer

Abstract

Recurrent Neural Networks have long been the dominating choice for sequencemodeling. However, it severely suffers from two issues: impotent in capturingvery long-term dependencies and unable to parallelize the sequentialcomputation procedure. Therefore, many non-recurrent sequence models that arebuilt on convolution and attention operations have been proposed recently.Notably, models with multi-head attention such as Transformer have demonstratedextreme effectiveness in capturing long-term dependencies in a variety ofsequence modeling tasks. Despite their success, however, these models lacknecessary components to model local structures in sequences and heavily rely onposition embeddings that have limited effects and require a considerable amountof design efforts. In this paper, we propose the R-Transformer which enjoys theadvantages of both RNNs and the multi-head attention mechanism while avoidstheir respective drawbacks. The proposed model can effectively capture bothlocal structures and global long-term dependencies in sequences without any useof position embeddings. We evaluate R-Transformer through extensive experimentswith data from a wide range of domains and the empirical results show thatR-Transformer outperforms the state-of-the-art methods by a large margin inmost of the tasks. We have made the code publicly available at\url{https://github.com/DSE-MSU/R-transformer}.

Quick Read (beta)

loading the full paper ...