Dissociating model architectures from inference computations

Abstract

Parr et al., 2025 examines how auto-regressive and deep temporal modelsdiffer in their treatment of non-Markovian sequence modelling. Building onthis, we highlight the need for dissociating model architectures, i.e., how thepredictive distribution factorises, from the computations invoked at inference.We demonstrate that deep temporal computations are mimicked by autoregressivemodels by structuring context access during iterative inference. Using atransformer trained on next-token prediction, we show that inducinghierarchical temporal factorisation during iterative inference maintainspredictive capacity while instantiating fewer computations. This emphasisesthat processes for constructing and refining predictions are not necessarilybound to their underlying model architectures.

Quick Read (beta)

loading the full paper ...