Abstract
Large vision language models exhibit notable limitations on Geometry ProblemSolving (GPS) because of their unreliable diagram interpretation and purenatural-language reasoning. A recent line of work mitigates this by usingsymbolic solvers: the model directly generates a formal program that a geometrysolver can execute. However, this direct program generation lacks intermediatereasoning, making the decision process opaque and prone to errors. In thiswork, we explore a new approach that integrates Chain-of-Thought (CoT) withformal language. The model interleaves natural language reasoning withincremental emission of solver-executable code, producing a hybrid reasoningtrace in which critical derivations are expressed in formal language. To teachthis behavior at scale, we combine (1) supervised fine-tuning on an 11K newlydeveloped synthetic dataset with interleaved natural language reasoning andautomatic formalization, and (2) solver-in-the-loop reinforcement learning thatjointly optimizes both the CoT narrative and the resulting program throughoutcome-based rewards. Built on Qwen2.5-VL-7B, our new model, namedGF-Reasoner, achieves up to 15% accuracy improvements on standard GPSbenchmarks, surpassing both 7B-scale peers and the much larger modelQwen2.5-VL-72B. By exploiting high-order geometric knowledge and offloadingsymbolic computation to the solver, the generated reasoning traces arenoticeably shorter and cleaner. Furthermore, we present a comprehensiveanalysis of method design choices (e.g., reasoning paradigms, data synthesis,training epochs, etc.), providing actionable insights for future research.