GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge

  • 2024-12-11 10:13:12
  • Daniil Gurgurov, Rishu Kumar, Simon Ostermann
  • 0

Abstract

Contextualized embeddings based on large language models (LLMs) are availablefor various languages, but their coverage is often limited for lower resourcedlanguages. Using LLMs for such languages is often difficult due to a highcomputational cost; not only during training, but also during inference. Staticword embeddings are much more resource-efficient ("green"), and thus stillprovide value, particularly for very low-resource languages. There is, however,a notable lack of comprehensive repositories with such embeddings for diverselanguages. To address this gap, we present GrEmLIn, a centralized repository ofgreen, static baseline embeddings for 87 mid- and low-resource languages. Wecompute GrEmLIn embeddings with a novel method that enhances GloVe embeddingsby integrating multilingual graph knowledge, which makes our static embeddingscompetitive with LLM representations, while being parameter-free at inferencetime. Our experiments demonstrate that GrEmLIn embeddings outperformstate-of-the-art contextualized embeddings from E5 on the task of lexicalsimilarity. They remain competitive in extrinsic evaluation tasks likesentiment analysis and natural language inference, with average performancegaps of just 5-10\% or less compared to state-of-the-art models, given asufficient vocabulary overlap with the target task, and underperform only ontopic classification. Our code and embeddings are publicly available athttps://huggingface.co/DFKI.

 

Quick Read (beta)

loading the full paper ...