Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Abstract

NLP tasks are often limited by scarcity of manually annotated data. In socialmedia sentiment analysis and related tasks, researchers have therefore usedbinarized emoticons and specific hashtags as forms of distant supervision. Ourpaper shows that by extending the distant supervision to a more diverse set ofnoisy labels, the models can learn richer representations. Through emojiprediction on a dataset of 1246 million tweets containing one of 64 commonemojis we obtain state-of-the-art performance on 8 benchmark datasets withinsentiment, emotion and sarcasm detection using a single pretrained model. Ouranalyses confirm that the diversity of our emotional labels yield a performanceimprovement over previous distant supervision approaches.

Quick Read (beta)

loading the full paper ...