Abstract
In recent years, multimodal benchmarks for general domains have guided therapid development of multimodal models on general tasks. However, the financialfield has its peculiarities. It features unique graphical images (e.g.,candlestick charts, technical indicator charts) and possesses a wealth ofspecialized financial knowledge (e.g., futures, turnover rate). Therefore,benchmarks from general fields often fail to measure the performance ofmultimodal models in the financial domain, and thus cannot effectively guidethe rapid development of large financial models. To promote the development oflarge financial multimodal models, we propose MME-Finance, an bilingualopen-ended and practical usage-oriented Visual Question Answering (VQA)benchmark. The characteristics of our benchmark are finance and expertise,which include constructing charts that reflect the actual usage needs of users(e.g., computer screenshots and mobile photography), creating questionsaccording to the preferences in financial domain inquiries, and annotatingquestions by experts with 10+ years of experience in the financial industry.Additionally, we have developed a custom-designed financial evaluation systemin which visual information is first introduced in the multi-modal evaluationprocess. Extensive experimental evaluations of 19 mainstream MLLMs areconducted to test their perception, reasoning, and cognition capabilities. Theresults indicate that models performing well on general benchmarks cannot dowell on MME-Finance; for instance, the top-performing open-source andclosed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o),respectively. Their performance is particularly poor in categories mostrelevant to finance, such as candlestick charts and technical indicator charts.In addition, we propose a Chinese version, which helps compare performance ofMLLMs under a Chinese context.