Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

Abstract

State-of-the-art large language models require specialized hardware andsubstantial energy to operate. As a consequence, cloud-based services thatprovide access to large language models have become very popular. In theseservices, the price users pay for an output provided by a model depends on thenumber of tokens the model uses to generate it -- they pay a fixed price pertoken. In this work, we show that this pricing mechanism creates a financialincentive for providers to strategize and misreport the (number of) tokens amodel used to generate an output, and users cannot prove, or even know, whethera provider is overcharging them. However, we also show that, if an unfaithfulprovider is obliged to be transparent about the generative process used by themodel, misreporting optimally without raising suspicion is hard. Nevertheless,as a proof-of-concept, we develop an efficient heuristic algorithm that allowsproviders to significantly overcharge users without raising suspicion.Crucially, we demonstrate that the cost of running the algorithm is lower thanthe additional revenue from overcharging users, highlighting the vulnerabilityof users under the current pay-per-token pricing mechanism. Further, we showthat, to eliminate the financial incentive to strategize, a pricing mechanismmust price tokens linearly on their character count. While this makes aprovider's profit margin vary across tokens, we introduce a simple prescriptionunder which the provider who adopts such an incentive-compatible pricingmechanism can maintain the average profit margin they had under thepay-per-token pricing mechanism. Along the way, to illustrate and complementour theoretical results, we conduct experiments with several large languagemodels from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$families, and input prompts from the LMSYS Chatbot Arena platform.

Quick Read (beta)

loading the full paper ...