Abstract
Language models (LMs) are used for a diverse range of tasks, from questionanswering to writing fantastical stories. In order to reliably accomplish thesetasks, LMs must be able to discern the modal category of a sentence (i.e.,whether it describes something that is possible, impossible, completelynonsensical, etc.). However, recent studies have called into question theability of LMs to categorize sentences according to modality (Michaelov et al.,2025; Kauf et al., 2023). In this work, we identify linear representations thatdiscriminate between modal categories within a variety of LMs, or modaldifference vectors. Analysis of modal difference vectors reveals that LMs haveaccess to more reliable modal categorization judgments than previouslyreported. Furthermore, we find that modal difference vectors emerge in aconsistent order as models become more competent (i.e., through training steps,layers, and parameter count). Notably, we find that modal difference vectorsidentified within LM activations can be used to model fine-grained humancategorization behavior. This potentially provides a novel view into how humanparticipants distinguish between modal categories, which we explore bycorrelating projections along modal difference vectors with human participants'ratings of interpretable features. In summary, we derive new insights into LMmodal categorization using techniques from mechanistic interpretability, withthe potential to inform our understanding of modal categorization in humans.