CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Abstract

The increasing ubiquity of text-to-image (T2I) models as tools for visualcontent generation raises concerns about their ability to accurately representdiverse cultural contexts -- where missed cues can stereotype communities andundermine usability. In this work, we present the first study to systematicallyquantify the alignment of T2I models and evaluation metrics with respect toboth explicit (stated) as well as implicit (unstated, implied by the prompt'scultural context) cultural expectations. To this end, we introduceCulturalFrames, a novel benchmark designed for rigorous human evaluation ofcultural representation in visual generations. Spanning 10 countries and 5socio-cultural domains, CulturalFrames comprises 983 prompts, 3637corresponding images generated by 4 state-of-the-art T2I models, and over 10kdetailed human annotations. We find that across models and countries, culturalexpectations are missed an average of 44% of the time. Among these failures,explicit expectations are missed at a surprisingly high average rate of 68%,while implicit expectation failures are also significant, averaging 49%.Furthermore, we show that existing T2I evaluation metrics correlate poorly withhuman judgments of cultural alignment, irrespective of their internalreasoning. Collectively, our findings expose critical gaps, provide a concretetestbed, and outline actionable directions for developing culturally informedT2I models and metrics that improve global usability.

Quick Read (beta)

loading the full paper ...