Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs)

Abstract

This paper presents a comprehensive study on the tokenization techniquesemployed by state-of-the-art large language models (LLMs) and theirimplications on the cost and availability of services across differentlanguages, especially low resource languages. The analysis considers multipleLLMs, including GPT-4 (using cl100k_base embeddings), GPT-3 (with p50k_baseembeddings), and DaVinci (employing r50k_base embeddings), as well as thewidely used BERT base tokenizer. The study evaluates the tokenizationvariability observed across these models and investigates the challenges oflinguistic representation in subword tokenization. The research underscores theimportance of fostering linguistically-aware development practices, especiallyfor languages that are traditionally under-resourced. Moreover, this paperintroduces case studies that highlight the real-world implications oftokenization choices, particularly in the context of electronic health record(EHR) systems. This research aims to promote generalizable Internationalization(I18N) practices in the development of AI services in this domain and beyond,with a strong emphasis on inclusivity, particularly for languages traditionallyunderrepresented in AI applications.

Quick Read (beta)

loading the full paper ...