Rethinking Masked Language Modeling for Chinese Spelling Correction

Abstract

In this paper, we study Chinese Spelling Correction (CSC) as a joint decisionmade by two separate models: a language model and an error model. Throughempirical analysis, we find that fine-tuning BERT tends to over-fit the errormodel while under-fit the language model, resulting in poor generalization toout-of-distribution error patterns. Given that BERT is the backbone of most CSCmodels, this phenomenon has a significant negative impact. To address thisissue, we are releasing a multi-domain benchmark LEMON, with higher quality anddiversity than existing benchmarks, to allow a comprehensive assessment of theopen domain generalization of CSC models. Then, we demonstrate that a verysimple strategy, randomly masking 20\% non-error tokens from the input sequenceduring fine-tuning is sufficient for learning a much better language modelwithout sacrificing the error model. This technique can be applied to any modelarchitecture and achieves new state-of-the-art results on SIGHAN, ECSpell, andLEMON.

Quick Read (beta)

loading the full paper ...