Abstract
Large Language Models (LLMs) are increasingly adopted as evaluators, offeringa scalable alternative to human annotation. However, existing supervisedfine-tuning (SFT) approaches often fall short in domains that demand complexreasoning. Judgment is inherently reasoning-intensive: beyond surface-levelscoring, it requires verifying evidence, identifying errors, and justifyingdecisions. Through the analysis of evaluation tasks, we find a negativecorrelation between SFT performance gains and the proportion ofreasoning-demanding samples, revealing the limits of SFT in such scenarios. Toaddress this, we introduce JudgeLRM, a family of judgment-oriented LLMs,trained using reinforcement learning (RL) with judge-wise, outcome-drivenrewards to activate reasoning capabilities. JudgeLRM consistently outperformSFT-tuned baselines in the same size, as well as other RL and SFT variants, andeven surpass state-of-the-art reasoning models: notably, JudgeLRM-3B/4B exceedsGPT-4, while JudgeLRM-7B/8B/14B outperforms DeepSeek-R1 by over 2% in F1 score,with particularly strong gains on reasoning-heavy tasks. Our findingsunderscore the value of RL in unlocking reasoning-aligned LLM judges.