FLIP Reasoning Challenge

  • 2025-04-16 18:07:16
  • Andreas Plesner, Turlan Kuzhagaliyev, Roger Wattenhofer
  • 0

Abstract

Over the past years, advances in artificial intelligence (AI) havedemonstrated how AI can solve many perception and generation tasks, such asimage classification and text writing, yet reasoning remains a challenge. Thispaper introduces the FLIP dataset, a benchmark for evaluating AI reasoningcapabilities based on human verification tasks on the Idena blockchain. FLIPchallenges present users with two orderings of 4 images, requiring them toidentify the logically coherent one. By emphasizing sequential reasoning,visual storytelling, and common sense, FLIP provides a unique testbed formultimodal AI systems. Our experiments evaluate state-of-the-art models,leveraging both vision-language models (VLMs) and large language models (LLMs).Results reveal that even the best open-sourced and closed-sourced modelsachieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shotsettings, compared to human performance of 95.3%. Captioning models aidreasoning models by providing text descriptions of images, yielding betterresults than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5Pro. Combining the predictions from 15 models in an ensemble increases theaccuracy to 85.2%. These findings highlight the limitations of existingreasoning models and the need for robust multimodal benchmarks like FLIP. Thefull codebase and dataset will be available athttps://github.com/aplesner/FLIP-Reasoning-Challenge.

 

Quick Read (beta)

loading the full paper ...