UniSent: Universal Adaptable Sentiment Lexica for 1000+ Languages

Abstract

In this paper, we introduce UniSent universal sentiment lexica for $1000+$languages. Sentiment lexica are vital for sentiment analysis in absence ofdocument-level annotations, a very common scenario for low-resource languages.To the best of our knowledge, UniSent is the largest sentiment resource to datein terms of the number of covered languages, including many low resource ones.In this work, we use a massively parallel Bible corpus to project sentimentinformation from English to other languages for sentiment analysis on Twitterdata. We introduce a method called DomDrift to mitigate the huge domainmismatch between Bible and Twitter by a confidence weighting scheme that usesdomain-specific embeddings to compare the nearest neighbors for a candidatesentiment word in the source (Bible) and target (Twitter) domain. We evaluatethe quality of UniSent in a subset of languages for which manually createdground truth was available, Macedonian, Czech, German, Spanish, and French. Weshow that the quality of UniSent is comparable to manually created sentimentresources when it is used as the sentiment seed for the task of word sentimentprediction on top of embedding representations. In addition, we show thatemoticon sentiments could be reliably predicted in the Twitter domain usingonly UniSent and monolingual embeddings in German, Spanish, French, andItalian. With the publication of this paper, we release the UniSent sentimentlexica.

Quick Read (beta)

loading the full paper ...