Co-occurrence is not Factual Association in Language Models

Abstract

Pretrained language models can encode a large amount of knowledge and utilizeit for various reasoning tasks, yet they can still struggle to learn novelfactual knowledge effectively from finetuning on limited textualdemonstrations. In this work, we show that the reason for this deficiency isthat language models are biased to learn word co-occurrence statistics insteadof true factual associations. We identify the differences between two forms ofknowledge representation in language models: knowledge in the form ofco-occurrence statistics is encoded in the middle layers of the transformermodel and does not generalize well to reasoning scenarios beyond simplequestion answering, while true factual associations are encoded in the lowerlayers and can be freely utilized in various reasoning tasks. Based on theseobservations, we propose two strategies to improve the learning of factualassociations in language models. We show that training on text with implicitrather than explicit factual associations can force the model to learn factualassociations instead of co-occurrence statistics, significantly improving thegeneralization of newly learned knowledge. We also propose a simple trainingmethod to actively forget the learned co-occurrence statistics, which unblocksand enhances the learning of factual associations when training on plainnarrative text. On both synthetic and real-world corpora, the two proposedstrategies improve the generalization of the knowledge learned duringfinetuning to reasoning scenarios such as indirect and multi-hop questionanswering.

Quick Read (beta)

loading the full paper ...