Abstract
We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingualbenchmark centered exclusively on Indian culture, designed to evaluate thecultural understanding of generative AI systems. Unlike existing benchmarkswith a generic or global scope, DRISHTIKON offers deep, fine-grained coverageacross India's diverse regions, spanning 15 languages, covering all states andunion territories, and incorporating over 64,000 aligned text-image pairs. Thedataset captures rich cultural themes including festivals, attire, cuisines,art forms, and historical heritage amongst many more. We evaluate a wide rangeof vision-language models (VLMs), including open-source small and large models,proprietary systems, reasoning-specialized VLMs, and Indic-focused models,across zero-shot and chain-of-thought settings. Our results expose keylimitations in current models' ability to reason over culturally grounded,multimodal inputs, particularly for low-resource languages and less-documentedtraditions. DRISHTIKON fills a vital gap in inclusive AI research, offering arobust testbed to advance culturally aware, multimodally competent languagetechnologies.