Abstract
This work investigates whether modern speech models are sensitive to prosodicemphasis - whether they encode emphasized and neutral words in systematicallydifferent ways. Prior work typically relies on isolated acoustic correlates(e.g., pitch, duration) or label prediction, both of which miss the relationalstructure of emphasis. This paper proposes a residual-based framework, definingemphasis as the difference between paired neutral and emphasized wordrepresentations. Analysis on self-supervised speech models shows that theseresiduals correlate strongly with duration changes and perform poorly at wordidentity prediction, indicating a structured, relational encoding of prosodicemphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% morecompact than in pre-trained models, further suggesting that emphasis is encodedas a consistent, low-dimensional transformation that becomes more structuredwith task-specific learning.