BERTuit: Understanding Spanish language in Twitter through a native transformer

Abstract

The appearance of complex attention-based language models such as BERT,Roberta or GPT-3 has allowed to address highly complex tasks in a plethora ofscenarios. However, when applied to specific domains, these models encounterconsiderable difficulties. This is the case of Social Networks such as Twitter,an ever-changing stream of information written with informal and complexlanguage, where each message requires careful evaluation to be understood evenby humans given the important role that context plays. Addressing tasks in thisdomain through Natural Language Processing involves severe challenges. Whenpowerful state-of-the-art multilingual language models are applied to thisscenario, language specific nuances use to get lost in translation. To facethese challenges we present \textbf{BERTuit}, the larger transformer proposedso far for Spanish language, pre-trained on a massive dataset of 230M Spanishtweets using RoBERTa optimization. Our motivation is to provide a powerfulresource to better understand Spanish Twitter and to be used on applicationsfocused on this social network, with special emphasis on solutions devoted totackle the spreading of misinformation in this platform. BERTuit is evaluatedon several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, verycompetitive multilingual transformers. The utility of our approach is shownwith applications, in this case: a zero-shot methodology to visualize groups ofhoaxes and profiling authors spreading disinformation. Misinformation spreads wildly on platforms such as Twitter in languages otherthan English, meaning performance of transformers may suffer when transferredoutside English speaking communities.

Quick Read (beta)

loading the full paper ...