Abstract
With the recent influx of bidirectional contextualized transformer languagemodels in the NLP, it becomes a necessity to have a systematic comparativestudy of these models on variety of datasets. Also, the performance of theselanguage models has not been explored on non-GLUE datasets. The study presentedin paper compares the state-of-the-art language models - BERT, ELECTRA and itsderivatives which include RoBERTa, ALBERT and DistilBERT. We conductedexperiments by finetuning these models for cross domain and disparate data andpenned an in-depth analysis of model's performances. Moreover, anexplainability of language models coherent with pretraining is presented whichverifies the context capturing capabilities of these models through a modelagnostic approach. The experimental results establish new state-of-the-art forYelp 2013 rating classification task and Financial Phrasebank sentimentdetection task with 69% accuracy and 88.2% accuracy respectively. Finally, thestudy conferred here can greatly assist industry researchers in choosing thelanguage model effectively in terms of performance or compute efficiency.