AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Abstract

Advances in English language representation enabled a more sample-efficientpre-training task by Efficiently Learning an Encoder that Classifies TokenReplacements Accurately (ELECTRA). Which, instead of training a model torecover masked tokens, it trains a discriminator model to distinguish trueinput tokens from corrupted tokens that were replaced by a generator network.On the other hand, current Arabic language representation approaches rely onlyon pretraining via masked language modeling. In this paper, we develop anArabic language representation model, which we name AraELECTRA. Our model ispretrained using the replaced token detection objective on large Arabic textcorpora. We evaluate our model on two Arabic reading comprehension tasks, andwe show that AraELECTRA outperforms current state-of-the-art Arabic languagerepresentation models given the same pretraining data and with even a smallermodel size.

Quick Read (beta)

loading the full paper ...