On the Ability of Self-Attention Networks to Recognize Counter Languages

Abstract

Transformers have supplanted recurrent models in a large number of NLP tasks.However, the differences in their abilities to model different syntacticproperties remain largely unknown. Past works suggest that LSTMs generalizevery well on regular languages and have close connections with counterlanguages. In this work, we systematically study the ability of Transformers tomodel such languages as well as the role of its individual components in doingso. We first provide a construction of Transformers for a subclass of counterlanguages, including well-studied languages such as n-ary Boolean Expressions,Dyck-1, and its generalizations. In experiments, we find that Transformers dowell on this subclass, and their learned mechanism strongly correlates with ourconstruction. Perhaps surprisingly, in contrast to LSTMs, Transformers do wellonly on a subset of regular languages with degrading performance as we makelanguages more complex according to a well-known measure of complexity. Ouranalysis also provides insights on the role of self-attention mechanism inmodeling certain behavior and the influence of positional encoding schemes onthe learning and generalization ability of the model.

Quick Read (beta)

loading the full paper ...