Abstract
Test-time scaling offers a promising way to improve the reasoning performanceof vision-language large models (VLLMs) without additional training. In thispaper, we explore a simple but effective approach for applying test-timescaling to radiology report generation. Specifically, we introduce alightweight Thought Graph Traversal (TGT) framework that guides the model toreason through organ-specific findings in a medically coherent order. Thisframework integrates structured medical priors into the prompt, enabling deeperand more logical analysis with no changes to the underlying model. To furtherenhance reasoning depth, we apply a reasoning budget forcing strategy thatadjusts the model's inference depth at test time by dynamically extending itsgeneration process. This simple yet powerful combination allows a frozenradiology VLLM to self-correct and generate more accurate, consistent chestX-ray reports. Our method outperforms baseline prompting approaches on standardbenchmarks, and also reveals dataset biases through traceable reasoning paths.Code and prompts are open-sourced for reproducibility athttps://github.com/glerium/Thought-Graph-Traversal.