Abstract
Detecting when a neural sequence model does "interesting" computation is anopen problem. The next token prediction loss is a poor indicator: Low loss canstem from trivially predictable sequences that are uninteresting, while highloss may reflect unpredictable but also irrelevant information that can beignored by the model. We propose a better metric: measuring the model's abilityto predict its own future hidden states. We show empirically that this metric-- in contrast to the next token prediction loss -- correlates with theintuitive interestingness of the task. To measure predictability, we introducethe architecture-agnostic "prediction of hidden states" (PHi) layer that servesas an information bottleneck on the main pathway of the network (e.g., theresidual stream in Transformers). We propose a novel learned predictive priorthat enables us to measure the novel information gained in each computationstep, which serves as our metric. We show empirically that our metric predictsthe description length of formal languages learned in-context, the complexityof mathematical reasoning problems, and the correctness of self-generatedreasoning chains.