We identify "stable regions" in the residual stream of Transformers, wherethe model's output remains insensitive to small activation changes, butexhibits high sensitivity at region boundaries. These regions emerge duringtraining and become more defined as training progresses or model sizeincreases. The regions appear to be much larger than previously studiedpolytopes. Our analysis suggests that these stable regions align with semanticdistinctions, where similar prompts cluster within regions, and activationsfrom the same region lead to similar next token predictions. This work providesa promising research direction for understanding the complexity of neuralnetworks, shedding light on training dynamics, and advancing interpretability.
Quick Read (beta)
loading the full paper ...