Improving the Robustness of Large Language Models via Consistency Alignment

Abstract

Large language models (LLMs) have shown tremendous success in following userinstructions and generating helpful responses. Nevertheless, their robustnessis still far from optimal, as they may generate significantly inconsistentresponses due to minor changes in the verbalized instructions. Recentliterature has explored this inconsistency issue, highlighting the importanceof continued improvement in the robustness of response generation. However,systematic analysis and solutions are still lacking. In this paper, wequantitatively define the inconsistency problem and propose a two-stagetraining framework consisting of instruction-augmented supervised fine-tuningand consistency alignment training. The first stage helps a model generalize onfollowing instructions via similar instruction augmentations. In the secondstage, we improve the diversity and help the model understand which responsesare more aligned with human expectations by differentiating subtle differencesin similar responses. The training process is accomplished by self-rewardsinferred from the trained model at the first stage without referring toexternal human preference resources. We conduct extensive experiments on recentpublicly available LLMs on instruction-following tasks and demonstrate theeffectiveness of our training framework.

Quick Read (beta)

loading the full paper ...