FaBERT: Pre-training BERT on Persian Blogs

Abstract

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogscorpus, encompassing both informal and formal Persian texts. FaBERT is designedto excel in traditional Natural Language Understanding (NLU) tasks, addressingthe intricacies of diverse sentence structures and linguistic styles prevalentin the Persian language. In our comprehensive evaluation of FaBERT on 12datasets in various downstream tasks, encompassing Sentiment Analysis (SA),Named Entity Recognition (NER), Natural Language Inference (NLI), QuestionAnswering (QA), and Question Paraphrasing (QP), it consistently demonstratedimproved performance, all achieved within a compact model size. The findingshighlight the importance of utilizing diverse and cleaned corpora, such asHmBlogs, to enhance the performance of language models like BERT in PersianNatural Language Processing (NLP) applications. FaBERT is openly accessible athttps://huggingface.co/sbunlp/fabert

Quick Read (beta)

loading the full paper ...