When Meaning Stays the Same, but Models Drift: Evaluating Quality of Service under Token-Level Behavioral Instability in LLMs

Abstract

We investigate how large language models respond to prompts that differ onlyin their token-level realization but preserve the same semantic intent, aphenomenon we call prompt variance. We propose Prompt-Based Semantic Shift(PBSS), a diagnostic framework for measuring behavioral drift in LLMs undersemantically equivalent prompt rewordings. Applied to ten constrained tasks,PBSS reveals consistent, model-specific response shifts, suggesting statisticalregularities linked to tokenization and decoding. These results highlight anoverlooked dimension of model evaluation stability under rephrasing and suggestthat tokenization strategies and decoding dynamics may contribute topost-training quality of service instability.

Quick Read (beta)

loading the full paper ...