Abstract
Subword tokenization requires balancing computational efficiency andvocabulary coverage, which often leads to suboptimal performance on languagesand scripts not prioritized during training. We propose to augment pretrainedlanguage models with a vocabulary-free encoder that generates input embeddingsfrom text rendered as pixels. Through experiments on English-centric languagemodels, we demonstrate that our approach substantially improves machinetranslation performance and facilitates effective cross-lingual transfer,outperforming tokenizer-based methods. Furthermore, we find that pixel-basedrepresentations outperform byte-level approaches and standard vocabularyexpansion. Our approach enhances the multilingual capabilities of monolinguallanguage models without extensive retraining and reduces decoding latency viainput compression.