FLIP Reasoning Challenge

Abstract

Over the past years, advances in artificial intelligence (AI) havedemonstrated how AI can solve many perception and generation tasks, such asimage classification and text writing, yet reasoning remains a challenge. Thispaper introduces the FLIP dataset, a benchmark for evaluating AI reasoningcapabilities based on human verification tasks on the Idena blockchain. FLIPchallenges present users with two orderings of 4 images, requiring them toidentify the logically coherent one. By emphasizing sequential reasoning,visual storytelling, and common sense, FLIP provides a unique testbed formultimodal AI systems. Our experiments evaluate state-of-the-art models,leveraging both vision-language models (VLMs) and large language models (LLMs).Results reveal that even the best open-sourced and closed-sourced modelsachieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shotsettings, compared to human performance of 95.3%. Captioning models aidreasoning models by providing text descriptions of images, yielding betterresults than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5Pro. Combining the predictions from 15 models in an ensemble increases theaccuracy to 85.2%. These findings highlight the limitations of existingreasoning models and the need for robust multimodal benchmarks like FLIP. Thefull codebase and dataset will be available athttps://github.com/aplesner/FLIP-Reasoning-Challenge.

Quick Read (beta)

loading the full paper ...