Generating bilingual example sentences with large language models as lexicography assistants

Abstract

We present a study of LLMs' performance in generating and rating examplesentences for bilingual dictionaries across languages with varying resourcelevels: French (high-resource), Indonesian (mid-resource), and Tetun(low-resource), with English as the target language. We evaluate the quality ofLLM-generated examples against the GDEX (Good Dictionary EXample) criteria:typicality, informativeness, and intelligibility. Our findings reveal thatwhile LLMs can generate reasonably good dictionary examples, their performancedegrades significantly for lower-resourced languages. We also observe highvariability in human preferences for example quality, reflected in lowinter-annotator agreement rates. To address this, we demonstrate thatin-context learning can successfully align LLMs with individual annotatorpreferences. Additionally, we explore the use of pre-trained language modelsfor automated rating of examples, finding that sentence perplexity serves as agood proxy for typicality and intelligibility in higher-resourced languages.Our study also contributes a novel dataset of 600 ratings for LLM-generatedsentence pairs, and provides insights into the potential of LLMs in reducingthe cost of lexicographic work, particularly for low-resource languages.

Quick Read (beta)

loading the full paper ...