Abstract
Faithfully personalizing large language models (LLMs) to align withindividual user preferences is a critical but challenging task. Whilesupervised fine-tuning (SFT) quickly reaches a performance plateau, standardreinforcement learning from human feedback (RLHF) also struggles with thenuances of personalization. Scalar-based reward models are prone to rewardhacking which leads to verbose and superficially personalized responses. Toaddress these limitations, we propose Critique-Post-Edit, a robustreinforcement learning framework that enables more faithful and controllablepersonalization. Our framework integrates two key components: (1) aPersonalized Generative Reward Model (GRM) that provides multi-dimensionalscores and textual critiques to resist reward hacking, and (2) aCritique-Post-Edit mechanism where the policy model revises its own outputsbased on these critiques for more targeted and efficient learning. Under arigorous length-controlled evaluation, our method substantially outperformsstandard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves anaverage 11\% win-rate improvement, and personalized Qwen2.5-14B model surpassesthe performance of GPT-4.1. These results demonstrate a practical path tofaithful, efficient, and controllable personalization.