Abstract
Synthetic verification techniques such as generating test cases and rewardmodelling are common ways to enhance the coding capabilities of large languagemodels (LLM) beyond predefined tests. Additionally, code verification hasrecently found great success as a critical component in improving reasoningcapability of LLMs via reinforcement learning. In this paper, we propose anapproach which can transform existing coding benchmarks into scoring andranking datasets to evaluate the effectiveness of synthetic verifiers. We alsopropose multiple metrics to measure different aspects of the syntheticverifiers with the proposed benchmarks. By employing the proposed approach, werelease four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzedsynthetic verification methods with standard, reasoning-based, and reward-basedLLMs. Our experiments show that reasoning can significantly improve test casegeneration and that scaling the number of test cases enhances the verificationaccuracy.