Overcoming Vocabulary Constraints with Pixel-level Fallback

  • 2025-04-02 21:50:31
  • Jonas F. Lotz, Hendra Setiawan, Stephan Peitz, Yova Kementchedjhieva
  • 0

Abstract

Subword tokenization requires balancing computational efficiency andvocabulary coverage, which often leads to suboptimal performance on languagesand scripts not prioritized during training. We propose to augment pretrainedlanguage models with a vocabulary-free encoder that generates input embeddingsfrom text rendered as pixels. Through experiments on English-centric languagemodels, we demonstrate that our approach substantially improves machinetranslation performance and facilitates effective cross-lingual transfer,outperforming tokenizer-based methods. Furthermore, we find that pixel-basedrepresentations outperform byte-level approaches and standard vocabularyexpansion. Our approach enhances the multilingual capabilities of monolinguallanguage models without extensive retraining and reduces decoding latency viainput compression.

 

Quick Read (beta)

loading the full paper ...