Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Abstract

Language models have graduated from being research prototypes tocommercialized products offered as web APIs, and recent works have highlightedthe multilingual capabilities of these products. The API vendors charge theirusers based on usage, more specifically on the number of ``tokens'' processedor generated by the underlying language models. What constitutes a token,however, is training data and model dependent with a large variance in thenumber of tokens required to convey the same information in differentlanguages. In this work, we analyze the effect of this non-uniformity on thefairness of an API's pricing policy across languages. We conduct a systematicanalysis of the cost and utility of OpenAI's language model API on multilingualbenchmarks in 22 typologically diverse languages. We show evidence thatspeakers of a large number of the supported languages are overcharged whileobtaining poorer results. These speakers tend to also come from regions wherethe APIs are less affordable to begin with. Through these analyses, we aim toincrease transparency around language model APIs' pricing policies andencourage the vendors to make them more equitable.

Quick Read (beta)

loading the full paper ...