Abstract
The emergence of LM-based judging reward modeling, represented by generativereward models, has successfully made reinforcement learning from AI feedback(RLAIF) efficient and scalable. To further advance this paradigm, we propose acore insight: this form of reward modeling shares fundamental formalconsistency with natural language inference (NLI), a core task in naturallanguage understanding. This reframed perspective points to a key path forbuilding superior reward models: scaling the model's comprehension boundaries.Pursuing this path, exploratory experiments on NLI tasks demonstrate that theslot prediction masked language models (MLMs) incorporating contextualexplanations achieve significantly better performance compared to mainstreamautoregressive models. Based on this key finding, we propose ESFP-RM, atwo-stage LM-based judging reward model that utilizes an explanation based slotframework for prediction to fully leverage the advantages of MLMs. Extensiveexperiments demonstrate that in both reinforcement learning from human feedback(RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework deliversmore stable and generalizable reward signals compared to generative rewardmodels.