On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies

Abstract

We study how masking and predicting tokens in an unsupervised fashion cangive rise to linguistic structures and downstream performance gains. Recenttheories have suggested that pretrained language models acquire usefulinductive biases through masks that implicitly act as cloze reductions fordownstream tasks. While appealing, we show that the success of the randommasking strategy used in practice cannot be explained by such cloze-like masksalone. We construct cloze-like masks using task-specific lexicons for threedifferent classification datasets and show that the majority of pretrainedperformance gains come from generic masks that are not associated with thelexicon. To explain the empirical success of these generic masks, wedemonstrate a correspondence between the Masked Language Model (MLM) objectiveand existing methods for learning statistical dependencies in graphical models.Using this, we derive a method for extracting these learned statisticaldependencies in MLMs and show that these dependencies encode useful inductivebiases in the form of syntactic structures. In an unsupervised parsingevaluation, simply forming a minimum spanning tree on the implied statisticaldependence structure outperforms a classic method for unsupervised parsing(58.74 vs. 55.91 UUAS).

Quick Read (beta)

loading the full paper ...