Abstract
FinanceBench is a first-of-its-kind test suite for evaluating the performanceof LLMs on open book financial question answering (QA). It comprises 10,231questions about publicly traded companies, with corresponding answers andevidence strings. The questions in FinanceBench are ecologically valid andcover a diverse set of scenarios. They are intended to be clear-cut andstraightforward to answer to serve as a minimum performance standard. We test16 state of the art model configurations (including GPT-4-Turbo, Llama2 andClaude2, with vector stores and long context prompts) on a sample of 150 casesfrom FinanceBench, and manually review their answers (n=2,400). The cases areavailable open-source. We show that existing LLMs have clear limitations forfinancial QA. Notably, GPT-4-Turbo used with a retrieval system incorrectlyanswered or refused to answer 81% of questions. While augmentation techniquessuch as using longer context window to feed in relevant evidence improveperformance, they are unrealistic for enterprise settings due to increasedlatency and cannot support larger financial documents. We find that all modelsexamined exhibit weaknesses, such as hallucinations, that limit theirsuitability for use by enterprises.