Abstract
Time-series reasoning remains a significant challenge in multimodal largelanguage models (MLLMs) due to the dynamic temporal patterns, ambiguoussemantics, and lack of temporal priors. In this work, we introduce TimeMaster,a reinforcement learning (RL)-based method that enables time-series MLLMs toperform structured, interpretable reasoning directly over visualizedtime-series inputs and task prompts. TimeMaster adopts a three-part structuredoutput format, reasoning, classification, and domain-specific extension, and isoptimized via a composite reward function that aligns format adherence,prediction accuracy, and open-ended insight quality. The model is trained usinga two-stage pipeline: we first apply supervised fine-tuning (SFT) to establisha good initialization, followed by Group Relative Policy Optimization (GRPO) atthe token level to enable stable and targeted reward-driven improvement intime-series reasoning. We evaluate TimeMaster on the TimerBed benchmark acrosssix real-world classification tasks based on Qwen2.5-VL-3B-Instruct. TimeMasterachieves state-of-the-art performance, outperforming both classical time-seriesmodels and few-shot GPT-4o by over 14.6% and 7.3% performance gain,respectively. Notably, TimeMaster goes beyond time-series classification: italso exhibits expert-like reasoning behavior, generates context-awareexplanations, and delivers domain-aligned insights. Our results highlight thatreward-driven RL can be a scalable and promising path toward integratingtemporal understanding into time-series MLLMs.