A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

Abstract

The lack or absence of parallel and comparable corpora makes bilinguallexicon extraction a difficult task for low-resource languages. The pivotlanguage and cognate recognition approaches have been proven useful forinducing bilingual lexicons for such languages. We propose constraint-basedbilingual lexicon induction for closely-related languages by extendingconstraints from the recent pivot-based induction technique and furtherenabling multiple symmetry assumption cycles to reach many more cognates in thetransgraph. We further identify cognate synonyms to obtain many-to-manytranslation pairs. This paper utilizes four datasets: one Austronesianlow-resource language and three Indo-European high-resource languages. We usethree constraint-based methods from our previous work, the Inverse Consultationmethod and translation pairs generated from the Cartesian product of inputdictionaries as baselines. We evaluate our result using the metrics ofprecision, recall and F-score. Our customizable approach allows the user toconduct cross-validation to predict the optimal hyperparameters (cognatethreshold and cognate synonym threshold) with various combinations ofheuristics and the number of symmetry assumption cycles to gain the highestF-score. Our proposed methods have statistically significant improvement ofprecision and F-score compared to our previous constraint-based methods. Theresults show that our method demonstrates the potential to complement otherbilingual dictionary creation methods like word alignment models using parallelcorpora for high-resource languages while well handling low-resource languages.

Quick Read (beta)

loading the full paper ...