Abstract
Despite significant advances in inference-time search for vision-languagemodels (VLMs), existing approaches remain both computationally expensive andprone to unpenalized, low-confidence generations which often lead to persistenthallucinations. We introduce \textbf{Value-guided Inference with Margin-basedReward (ViMaR)}, a two-stage inference framework that improves both efficiencyand output fidelity by combining a temporal-difference value model with amargin-aware reward adjustment. In the first stage, we perform a single pass toidentify the highest-value caption among diverse candidates. In the secondstage, we selectively refine only those segments that were overlooked orexhibit weak visual grounding, thereby eliminating frequently rewardedevaluations. A calibrated margin-based penalty discourages low-confidencecontinuations while preserving descriptive richness. Extensive experimentsacross multiple VLM architectures demonstrate that ViMaR generates captionsthat are significantly more reliable, factually accurate, detailed, andexplanatory, while achieving over 4$\times$ speedup compared to existingvalue-guided methods. Specifically, we show that ViMaR trained solely on LLaVAMistral-7B, \textit{generalizes effectively to guide decoding in a strongerunseen model}. To further validate this, we adapt the ViMaR to steer generationin LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in captionquality and demonstrating robust cross-model guidance. This cross-modelgeneralization highlights ViMaR's flexibility and modularity, positioning it asa scalable and transferable inference-time decoding strategy. Furthermore, whenViMaR-generated captions are used for self-training, the underlying modelsachieve substantial gains across a broad suite of visual comprehensionbenchmarks, underscoring the potential of fast, accurate, and self-improvingVLM pipelines.