Abstract
Visual reasoning (VR), which is crucial in many fields for enablinghuman-like visual understanding, remains highly challenging. Recently,compositional visual reasoning approaches, which leverage the reasoningabilities of large language models (LLMs) with integrated tools to solveproblems, have shown promise as more effective strategies than end-to-end VRmethods. However, these approaches face limitations, as frozen LLMs lack toolawareness in VR, leading to performance bottlenecks. While leveraging LLMs forreasoning is widely used in other domains, they are not directly applicable toVR due to limited training data, imperfect tools that introduce errors andreduce data collection efficiency in VR, and challenging in fine-tuning onnoisy workflows. To address these challenges, we propose DWIM: i)Discrepancy-aware training Workflow generation, which assesses tool usage andextracts more viable workflows for training; and ii) Instruct-Maskingfine-tuning, which guides the model to only clone effective actions, enablingthe generation of more practical solutions. Our experiments demonstrate thatDWIM achieves state-of-the-art performance across various VR tasks, exhibitingstrong generalization on multiple widely-used datasets.