Abstract
This paper considers the challenges Large Language Models (LLMs) face whenreasoning over text that includes information involving uncertainty explicitlyquantified via probability values. This type of reasoning is relevant to avariety of contexts ranging from everyday conversations to medicaldecision-making. Despite improvements in the mathematical reasoningcapabilities of LLMs, they still exhibit significant difficulties when it comesto probabilistic reasoning. To deal with this problem, we introduce theBayesian Linguistic Inference Dataset (BLInD), a new dataset specificallydesigned to test the probabilistic reasoning capabilities of LLMs. We use BLInDto find out the limitations of LLMs for tasks involving probabilisticreasoning. In addition, we present several prompting strategies that map theproblem to different formal representations, including Python code,probabilistic algorithms, and probabilistic logical programming. We conclude byproviding an evaluation of our methods on BLInD and an adaptation of a causalreasoning question-answering dataset. Our empirical results highlight theeffectiveness of our proposed strategies for multiple LLMs.