Comparison of Turkish Word Representations Trained on Different Morphological Forms

Abstract

Increased popularity of different text representations has also brought manyimprovements in Natural Language Processing (NLP) tasks. Without need ofsupervised data, embeddings trained on large corpora provide us meaningfulrelations to be used on different NLP tasks. Even though training these vectorsis relatively easy with recent methods, information gained from the dataheavily depends on the structure of the corpus language. Since the popularlyresearched languages have a similar morphological structure, problems occurringfor morphologically rich languages are mainly disregarded in studies. Formorphologically rich languages, context-free word vectors ignore morphologicalstructure of languages. In this study, we prepared texts in morphologicallydifferent forms in a morphologically rich language, Turkish, and compared theresults on different intrinsic and extrinsic tasks. To see the effect ofmorphological structure, we trained word2vec model on texts which lemma andsuffixes are treated differently. We also trained subword model fastText andcompared the embeddings on word analogy, text classification, sentimentalanalysis, and language model tasks.

Quick Read (beta)

loading the full paper ...