UnibucKernel Reloaded: First Place in Arabic Dialect Identification for the Second Year in a Row

  • 2018-06-21 16:24:46
  • Andrei M. Butnaru, Radu Tudor Ionescu
  • 0

Abstract

We present a machine learning approach that ranked on the first place in theArabic Dialect Identification (ADI) Closed Shared Tasks of the 2018 VarDialEvaluation Campaign. The proposed approach combines several kernels usingmultiple kernel learning. While most of our kernels are based on characterp-grams (also known as n-grams) extracted from speech or phonetic transcripts,we also use a kernel based on dialectal embeddings generated from audiorecordings by the organizers. In the learning stage, we independently employKernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR).Preliminary experiments indicate that KRR provides better classificationresults. Our approach is shallow and simple, but the empirical results obtainedin the 2018 ADI Closed Shared Task prove that it achieves the best performance.Furthermore, our top macro-F1 score (58.92%) is significantly better than thesecond best score (57.59%) in the 2018 ADI Shared Task, according to thestatistical significance test performed by the organizers. Nevertheless, weobtain even better post-competition results (a macro-F1 score of 62.28%) usingthe audio embeddings released by the organizers after the competition. With avery similar approach (that did not include phonetic features), we also rankedfirst in the ADI Closed Shared Tasks of the 2017 VarDial Evaluation Campaign,surpassing the second best method by 4.62%. We therefore conclude that ourmultiple kernel learning method is the best approach to date for Arabic dialectidentification.

 

Quick Read (beta)

loading the full paper ...