Abstract
To mitigate societal biases implicitly encoded in recent successfulpretrained language models, a diverse array of approaches have been proposed toencourage model fairness, focusing on prompting, data augmentation, regularizedfine-tuning, and more. Despite the development, it is nontrivial to reach aprincipled understanding of fairness and an effective algorithm that canconsistently debias language models. In this work, by rigorous evaluations ofNeural Collapse -- a learning phenomenon happen in last-layer representationsand classifiers in deep networks -- on fairness-related words, we find thatdebiased language models exhibit collapsed alignment between tokenrepresentations and word embeddings. More importantly, this observationinspires us to design a principled fine-tuning method that can effectivelyimprove fairness in a wide range of debiasing methods, while still preservingthe performance of language models on standard natural language understandingtasks. We attach our code at https://github.com/Xujxyang/Fairness-NC-main.