Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages

Abstract

Analysis of informative contents and sentiments of social users has beenattempted quite intensively in the recent past. Most of the systems are usableonly for monolingual data and fails or gives poor results when used on datawith code-mixing property. To gather attention and encourage researchers towork on this crisis, we prepared gold standard Bengali-English code-mixed datawith language and polarity tag for sentiment analysis purposes. In this paper,we discuss the systems we prepared to collect and filter raw Twitter data. Inorder to reduce manual work while annotation, hybrid systems combining rulebased and supervised models were developed for both language and sentimenttagging. The final corpus was annotated by a group of annotators following afew guidelines. The gold standard corpus thus obtained has impressiveinter-annotator agreement obtained in terms of Kappa values. Various metricslike Code-Mixed Index (CMI), Code-Mixed Factor (CF) along with various aspects(language and emotion) also qualitatively polled the code-mixed and sentimentproperties of the corpus.

Quick Read (beta)

loading the full paper ...