Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

Abstract

While AI agents have shown remarkable performance at various tasks, theystill struggle with complex multi-modal applications, structured generation andstrategic planning. Improvements via standard fine-tuning is often impractical,as solving agentic tasks usually relies on black box API access without controlover model parameters. Inference-time methods such as Best-of-N (BON) samplingoffer a simple yet effective alternative to improve performance. However, BONlacks iterative feedback integration mechanism. Hence, we propose IterativeAgent Decoding (IAD) which combines iterative refinement with dynamic candidateevaluation and selection guided by a verifier. IAD differs in how feedback isdesigned and integrated, specifically optimized to extract maximal signal fromreward scores. We conduct a detailed comparison of baselines across key metricson Sketch2Code, Text2SQL, and Webshop where IAD consistently outperformsbaselines, achieving 3--6% absolute gains on Sketch2Code and Text2SQL (with andwithout LLM judges) and 8--10% gains on Webshop across multiple metrics. Tobetter understand the source of IAD's gains, we perform controlled experimentsto disentangle the effect of adaptive feedback from stochastic sampling, andfind that IAD's improvements are primarily driven by verifier-guidedrefinement, not merely sampling diversity. We also show that both IAD and BONexhibit inference-time scaling with increased compute when guided by an optimalverifier. Our analysis highlights the critical role of verifier quality ineffective inference-time optimization and examines the impact of noisy andsparse rewards on scaling behavior. Together, these findings offer key insightsinto the trade-offs and principles of effective inference-time optimization.

Quick Read (beta)

loading the full paper ...