Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Abstract

Transformer-based language models are effective but complex, andunderstanding their inner workings is a significant challenge. Previousresearch has primarily explored how these models handle simple tasks like namecopying or selection, and we extend this by investigating how these modelsgrasp complex, recursive language structures defined by context-free grammars(CFGs). We introduce a family of synthetic CFGs that produce hierarchicalrules, capable of generating lengthy sentences (e.g., hundreds of tokens) thatare locally ambiguous and require dynamic programming to parse. Despite thiscomplexity, we demonstrate that generative models like GPT can accurately learnthis CFG language and generate sentences based on it. We explore the model'sinternals, revealing that its hidden states precisely capture the structure ofCFGs, and its attention patterns resemble the information passing in a dynamicprogramming algorithm. This paper also presents several corollaries, including showing whypositional embedding is inferior to relative attention or rotary embedding;demonstrating that encoder-based models (e.g., BERT, deBERTa) cannot learn verydeeply nested CFGs as effectively as generative models (e.g., GPT); andhighlighting the necessity of adding structural and syntactic errors to thepretraining data to make the model more robust to corrupted language prefixes.

Quick Read (beta)

loading the full paper ...