VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Abstract

Recent advancements in vision-language models (VLMs) have improvedperformance by increasing the number of visual tokens, which are oftensignificantly longer than text tokens. However, we observe that most real-worldscenarios do not require such an extensive number of visual tokens. While theperformance drops significantly in a small subset of OCR-related tasks, modelsstill perform accurately in most other general VQA tasks with only 1/4resolution. Therefore, we propose to dynamically process distinct samples withdifferent resolutions, and present a new paradigm for visual token compression,namely, VisionThink. It starts with a downsampled image and smartly decideswhether it is sufficient for problem solving. Otherwise, the model could outputa special token to request the higher-resolution image. Compared to existingEfficient VLM methods that compress tokens using fixed pruning ratios orthresholds, VisionThink autonomously decides whether to compress tokens case bycase. As a result, it demonstrates strong fine-grained visual understandingcapability on OCR-related tasks, and meanwhile saves substantial visual tokenson simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judgestrategy to successfully apply RL to general VQA tasks. Moreover, we carefullydesign a reward function and penalty mechanism to achieve a stable andreasonable image resize call ratio. Extensive experiments demonstrate thesuperiority, efficiency, and effectiveness of our method. Our code is availableat https://github.com/dvlab-research/VisionThink.

Quick Read (beta)

loading the full paper ...