Language Models use Lookbacks to Track Beliefs

Abstract

How do language models (LMs) represent characters' beliefs, especially whenthose beliefs may differ from reality? This question lies at the heart ofunderstanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs'ability to reason about characters' beliefs using causal mediation andabstraction. We construct a dataset, CausalToM, consisting of simple storieswhere two characters independently change the state of two objects, potentiallyunaware of each other's actions. Our investigation uncovers a pervasivealgorithmic pattern that we call a lookback mechanism, which enables the LM torecall important information when it becomes necessary. The LM binds eachcharacter-object-state triple together by co-locating their referenceinformation, represented as Ordering IDs (OIs), in low-rank subspaces of thestate token's residual stream. When asked about a character's beliefs regardingthe state of an object, the binding lookback retrieves the correct state OI andthen the answer lookback retrieves the corresponding state token. When weintroduce text specifying that one character is (not) visible to the other, wefind that the LM first generates a visibility ID encoding the relation betweenthe observing and the observed character OIs. In a visibility lookback, this IDis used to retrieve information about the observed character and update theobserving character's beliefs. Our work provides insights into belief trackingmechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Quick Read (beta)

loading the full paper ...