Enhancing elusive clues in knowledge learning by contrasting attention of language models

Abstract

Causal language models acquire vast amount of knowledge from general textcorpus during pretraining, but the efficiency of knowledge learning is known tobe unsatisfactory, especially when learning from knowledge-dense andsmall-sized corpora. The deficiency can come from long-distance dependencieswhich are hard to capture by language models, and overfitting to co-occurrencepatterns and distracting clues in the training text. To address these issues,the paper proposes a method to enhance knowledge learning during language modelpretraining, by enhancing elusive but important clues in text discovered by thelanguage model themselves. We found that larger language models pay moreattention to non-obvious but important clues, which are often overlooked bysmaller language models. Therefore, we can identify these clues by contrastingthe attention weights of large and small language models. We use the identifiedclues as a guide to perform token-dropout data augmentation on the trainingtext, and observed a significant boost in both small and large models'performance in fact memorization. This shows that the behavior contrast betweenmore and less-performant language models contains important clues for knowledgelearning, and it can be ``amplified" for a straight-forward improvement inknowledge learning efficiency.

Quick Read (beta)

loading the full paper ...