Thyme: Think Beyond Images

Abstract

Following OpenAI's introduction of the ``thinking with images'' concept,recent efforts have explored stimulating the use of visual information in thereasoning process to enhance model performance in perception and reasoningtasks. However, to the best of our knowledge, no open-source work currentlyoffers a feature set as rich as proprietary models (O3), which can performdiverse image manipulations and simultaneously enhance logical reasoningcapabilities through code. In this paper, we make a preliminary attempt in thisdirection by introducing Thyme (Think Beyond Images), a novel paradigm forenabling MLLMs to transcend existing ``think with images'' approaches byautonomously generating and executing diverse image processing andcomputational operations via executable code. This approach not onlyfacilitates a rich, on-the-fly set of image manipulations (e.g., cropping,rotation, contrast enhancement) but also allows for mathematical computations,all while maintaining high autonomy in deciding when and how to apply theseoperations. We activate this capability through a two-stage training strategy:an initial SFT on a curated dataset of 500K samples to teach code generation,followed by a RL phase to refine decision-making. For the RL stage, we manuallycollect and design high-resolution question-answer pairs to increase thelearning difficulty, and we propose GRPO-ATS (Group Relative PolicyOptimization with Adaptive Temperature Sampling), an algorithm that appliesdistinct temperatures to text and code generation to balance reasoningexploration with code execution precision. We conduct extensive experimentalanalysis and ablation studies. Comprehensive evaluations on nearly 20benchmarks show that Thyme yields significant and consistent performance gains,particularly in challenging high-resolution perception and complex reasoningtasks.

Quick Read (beta)

loading the full paper ...