Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

Abstract

Transformer-based models generate hidden states that are difficult tointerpret. In this work, we analyze hidden states and modify them at inference,with a focus on motion forecasting. We use linear probing to analyze whetherinterpretable features are embedded in hidden states. Our experiments revealhigh probing accuracy, indicating latent space regularities with functionallyimportant directions. Building on this, we use the directions between hiddenstates with opposing features to fit control vectors. At inference, we add ourcontrol vectors to hidden states and evaluate their impact on predictions.Remarkably, such modifications preserve the feasibility of predictions. Wefurther refine our control vectors using sparse autoencoders (SAEs). This leadsto more linear changes in predictions when scaling control vectors. Ourapproach enables mechanistic interpretation as well as zero-shot generalizationto unseen dataset characteristics with negligible computational overhead.

Quick Read (beta)

loading the full paper ...