Abstract
Similar to LLMs, the development of vision language models is mainly drivenby English datasets and models trained in English and Chinese language, whereassupport for other languages, even those considered high-resource languages suchas German, remains significantly weaker. In this work we present an analysis ofopen-weight VLMs on factual knowledge in the German and English language. Wedisentangle the image-related aspects from the textual ones by analyzingaccu-racy with jury-as-a-judge in both prompt languages and images from Germanand international contexts. We found that for celebrities and sights, VLMsstruggle because they are lacking visual cognition of German image contents.For animals and plants, the tested models can often correctly identify theimage contents ac-cording to the scientific name or English common name butfail in German lan-guage. Cars and supermarket products were identified equallywell in English and German images across both prompt languages.