How transformers learn structured data: insights from hierarchical filtering

Abstract

We introduce a hierarchical filtering procedure for generative models ofsequences on trees, enabling control over the range of positional correlationsin the data. Leveraging this controlled setting, we provide evidence thatvanilla encoder-only transformer architectures can implement the optimal BeliefPropagation algorithm on both root classification and masked language modelingtasks. Correlations at larger distances corresponding to increasing layers ofthe hierarchy are sequentially included as the network is trained. We analyzehow the transformer layers succeed by focusing on attention maps from modelstrained with varying degrees of filtering. These attention maps show clearevidence for iterative hierarchical reconstruction of correlations, and we canrelate these observations to a plausible implementation of the exact inferencealgorithm for the network sizes considered.

Quick Read (beta)

loading the full paper ...