Abstract
We introduce a new pretraining approach for language models that are gearedto support multi-document NLP tasks. Our cross-document language model (CD-LM)improves masked language modeling for these tasks with two key ideas. First, wepretrain with multiple related documents in a single input, via cross-documentmasking, which encourages the model to learn cross-document and long-rangerelationships. Second, extending the recent Longformer model, we pretrain withlong contexts of several thousand tokens and introduce a new attention patternthat uses sequence-level global attention to predict masked tokens, whileretaining the familiar local attention elsewhere. We show that our CD-LM setsnew state-of-the-art results for several multi-text tasks, includingcross-document event and entity coreference resolution, paper citationrecommendation, and documents plagiarism detection, while using a significantlyreduced number of training parameters relative to prior works.