Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality

Abstract

During the finetuning stage of text generation tasks, standard cross-entropyloss treats all tokens equally. This can lead models to overemphasizehigh-frequency, low-information tokens, neglecting lower-frequency tokenscrucial for specificity and informativeness in generated content. This paperintroduces a novel loss function, Power-Law Decay Loss (PDL), specificallydesigned to optimize the finetuning process for text generation. The coremotivation for PDL stems from observations in information theory andlinguistics: the informativeness of a token is often inversely proportional toits frequency of occurrence. PDL re-weights the contribution of each token inthe standard cross-entropy loss based on its frequency in the training corpus,following a power-law decay. Specifically, the weights for high-frequencytokens are reduced, while low-frequency, information-dense tokens are assignedhigher weights. This mechanism guides the model during finetuning to focus moreon learning and generating tokens that convey specific and unique information,thereby enhancing the quality, diversity, and informativeness of the generatedtext. We theoretically elaborate on the motivation and construction of PDL anddiscuss its potential applications and advantages across various textgeneration finetuning tasks, such as abstractive summarization, dialoguesystems, and style transfer.

Quick Read (beta)

loading the full paper ...