Abstract
Large language models (LLMs) have shown remarkable proficiency in human-levelreasoning and generation capabilities, which encourages extensive research ontheir application in mathematical problem solving. However, current work hasbeen largely focused on text-based mathematical problems, with limitedinvestigation in problems involving geometric information. Addressing this gap,we aim to enable LLMs to solve geometric problems by understanding image input.We first analyze the limitations of current Multimodal Large Language Models(MLLMs) in this area: they struggle to accurately comprehending basic geometricelements and their relationships. To overcome these challenges, we takeadvantage of the unique characteristics of geometric problems (such as uniquegeometric logical form, and geometric scalability) and the capacity of thetextual LLMs to build an enriched multimodal geometry dataset based on existingdata. The augmented dataset, Geo170K, contains more than 170K geometricimage-caption and question-answer pairs. Utilizing our constructed Geo170Kdataset, we develop G-LLaVA, which demonstrates exceptional performance insolving geometric problems, significantly outperforming GPT-4-V on theMathVista benchmark with only 7B parameters.