Abstract
We propose a synthetic task, LEGO (Learning Equality and Group Operations),that encapsulates the problem of following a chain of reasoning, and we studyhow the transformer architecture learns this task. We pay special attention todata effects such as pretraining (on seemingly unrelated NLP tasks) and datasetcomposition (e.g., differing chain length at training and test time), as wellas architectural variants such as weight-tied layers or adding convolutionalcomponents. We study how the trained models eventually succeed at the task, andin particular, we are able to understand (to some extent) some of the attentionheads as well as how the information flows in the network. Based on theseobservations we propose a hypothesis that here pretraining helps merely due tobeing a smart initialization rather than some deep knowledge stored in thenetwork. We also observe that in some data regime the trained transformer finds"shortcut" solutions to follow the chain of reasoning, which impedes themodel's ability to generalize to simple variants of the main task, and moreoverwe find that one can prevent such shortcut with appropriate architecturemodification or careful data preparation. Motivated by our findings, we beginto explore the task of learning to execute C programs, where a convolutionalmodification to transformers, namely adding convolutional structures in thekey/query/value maps, shows an encouraging edge.