J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

Abstract

To keep pace with the increasing pace of large language models (LLM)development, model output evaluation has transitioned away from time-consuminghuman evaluation to automatic evaluation, where LLMs themselves are tasked withassessing and critiquing other model outputs. LLM-as-judge models are a classof generative evaluators that excel in evaluating relatively simple domains,like chat quality, but struggle in reasoning intensive domains where modelresponses contain more substantive and challenging content. To remedy existingjudge shortcomings, we explore training judges with reinforcement learning(RL). We make three key contributions: (1) We propose the Equivalent InitialState Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows usto train our judge to be robust to positional biases that arise in more complexevaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark thatevaluates judges in diverse reasoning settings not covered by prior work. (3)We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO thatoutperforms GPT-4o and the next best small judge by 6.7% and 9%, matching orexceeding the performance of larger GRPO-trained judges on both JudgeBench andReasoningJudgeBench.

Quick Read (beta)

loading the full paper ...