Studying Large Language Model Generalization with Influence Functions

Abstract

When trying to gain better visibility into a machine learning model in orderto understand and mitigate the associated risks, a potentially valuable sourceof evidence is: which training examples most contribute to a given behavior?Influence functions aim to answer a counterfactual: how would the model'sparameters (and hence its outputs) change if a given sequence were added to thetraining set? While influence functions have produced insights for smallmodels, they are difficult to scale to large language models (LLMs) due to thedifficulty of computing an inverse-Hessian-vector product (IHVP). We use theEigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC)approximation to scale influence functions up to LLMs with up to 52 billionparameters. In our experiments, EK-FAC achieves similar accuracy to traditionalinfluence function estimators despite the IHVP computation being orders ofmagnitude faster. We investigate two algorithmic techniques to reduce the costof computing gradients of candidate training sequences: TF-IDF filtering andquery batching. We use influence functions to investigate the generalizationpatterns of LLMs, including the sparsity of the influence patterns, increasingabstraction with scale, math and programming abilities, cross-lingualgeneralization, and role-playing behavior. Despite many apparentlysophisticated forms of generalization, we identify a surprising limitation:influences decay to near-zero when the order of key phrases is flipped.Overall, influence functions give us a powerful new tool for studying thegeneralization properties of LLMs.

Quick Read (beta)

loading the full paper ...