Abstract
With the introduction of transformer-based models for vision and languagetasks, such as LLaVA and Chameleon, there has been renewed interest in thediscrete tokenized representation of images. These models often treat imagepatches as discrete tokens, analogous to words in natural language, learningjoint alignments between visual and human languages. However, little is knownabout the statistical behavior of these visual languages - whether they followsimilar frequency distributions, grammatical structures, or topologies asnatural languages. In this paper, we take a natural-language-centric approachto analyzing discrete visual languages and uncover striking similarities andfundamental differences. We demonstrate that, although visual languages adhereto Zipfian distributions, higher token innovation drives greater entropy andlower compression, with tokens predominantly representing object parts,indicating intermediate granularity. We also show that visual languages lackcohesive grammatical structures, leading to higher perplexity and weakerhierarchical organization compared to natural languages. Finally, wedemonstrate that, while vision models align more closely with natural languagesthan other models, this alignment remains significantly weaker than thecohesion found within natural languages. Through these experiments, wedemonstrate how understanding the statistical properties of discrete visuallanguages can inform the design of more effective computer vision models.