A Comparative Analysis of Static Word Embeddings for Hungarian

Abstract

This paper presents a comprehensive analysis of various static wordembeddings for Hungarian, including traditional models such as Word2Vec,FastText, as well as static embeddings derived from BERT-based models usingdifferent extraction methods. We evaluate these embeddings on both intrinsicand extrinsic tasks to provide a holistic view of their performance. Forintrinsic evaluation, we employ a word analogy task, which assesses theembeddings ability to capture semantic and syntactic relationships. Our resultsindicate that traditional static embeddings, particularly FastText, excel inthis task, achieving high accuracy and mean reciprocal rank (MRR) scores. Amongthe BERT-based models, the X2Static method for extracting static embeddingsdemonstrates superior performance compared to decontextualized and aggregatemethods, approaching the effectiveness of traditional static embeddings. Forextrinsic evaluation, we utilize a bidirectional LSTM model to perform NamedEntity Recognition (NER) and Part-of-Speech (POS) tagging tasks. The resultsreveal that embeddings derived from dynamic models, especially those extractedusing the X2Static method, outperform purely static embeddings. Notably, ELMoembeddings achieve the highest accuracy in both NER and POS tagging tasks,underscoring the benefits of contextualized representations even when used in astatic form. Our findings highlight the continued relevance of static wordembeddings in NLP applications and the potential of advanced extraction methodsto enhance the utility of BERT-based models. This piece of research contributesto the understanding of embedding performance in the Hungarian language andprovides valuable insights for future developments in the field. The trainingscripts, evaluation codes, restricted vocabulary, and extracted embeddings willbe made publicly available to support further research and reproducibility.

Quick Read (beta)

loading the full paper ...