Abstract
Understanding and reasoning over text within visual contexts poses asignificant challenge for Vision-Language Models (VLMs), given the complexityand diversity of real-world scenarios. To address this challenge, text-richVisual Question Answering (VQA) datasets and benchmarks have emerged forhigh-resource languages like English. However, a critical gap persists forlow-resource languages such as Korean, where the lack of comprehensivebenchmarks hinders robust model evaluation and comparison. To bridge this gap,we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-richVQA Attuned to diverse visual contexts. KRETA facilitates an in-depthevaluation of both visual text understanding and reasoning capabilities, whilealso supporting a multifaceted assessment across 15 domains and 26 image types.Additionally, we introduce a semi-automated VQA generation pipelinespecifically optimized for text-rich settings, leveraging refined stepwiseimage decomposition and a rigorous seven-metric evaluation protocol to ensuredata quality. While KRETA is tailored for Korean, we hope our adaptable andextensible pipeline will facilitate the development of similar benchmarks inother languages, thereby accelerating multilingual VLM research. The code anddataset for KRETA are available at https://github.com/tabtoyou/KRETA.