Abstract
Scaling test-time computation, or affording a generator large language model(LLM) extra compute during inference, typically employs the help of externalnon-generative evaluators (i.e., reward models). Concurrently, LLM-judges,models trained to generate evaluations and critiques (explanations) in naturallanguage, are becoming increasingly popular in automatic evaluation. Despitejudge empirical successes, their effectiveness as evaluators in test-timescaling settings is largely unknown. In this paper, we introduce the JudgeEvaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judgeperformance in three domains (math reasoning, code generation, and instructionfollowing) under three task settings: response reranking, step-level beamsearch, and critique-based response refinement. We evaluate 10 different judgemodels (7B-70B parameters) for 8 different base generator models (6.7B-72Bparameters). Our benchmark shows that while judges are competitive with outcomereward models in reranking, they are consistently worse than process rewardmodels in beam search procedures. Furthermore, though unique to LLM-judges,their natural language critiques are currently ineffective in guiding thegenerator towards better responses.