NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Abstract

Cognitive textual and visual reasoning tasks, including puzzles, series, andanalogies, demand the ability to quickly reason, decipher, and evaluatepatterns both textually and spatially. Due to extensive training on vastamounts of human-curated data, LLMs and VLMs excel in common-sense reasoningtasks, however still struggle with more complex reasoning that demands deepercognitive understanding. We introduce NTSEBench, a new dataset designed toevaluate cognitive multi-modal reasoning and problem-solving skills of largemodels. The dataset contains 2728 multiple-choice questions, accompanied by atotal of 4,642 images, categorized into 26 different types. These questions aredrawn from the nationwide NTSE examination in India and feature a mix of visualand textual general aptitude challenges, designed to assess intelligence andcritical thinking skills beyond mere rote learning. We establish baselines onthe dataset using state-of-the-art LLMs and VLMs. To facilitate a comparisonbetween open source and propriety models, we propose four distinct modelingstrategies to handle different modalities -- text and images -- in the datasetinstances.

Quick Read (beta)

loading the full paper ...