Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Abstract

Text image is a unique and crucial information medium that integrates visualaesthetics and linguistic semantics in modern e-society. Due to their subtletyand complexity, the generation of text images represents a challenging andevolving frontier in the image generation field. The recent surge ofspecialized image generators (\emph{e.g.}, Flux-series) and unified generativemodels (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises anatural question: can they master the intricacies of text image generation andediting? Motivated by this, we assess current state-of-the-art generativemodels' capabilities in terms of text image generation and editing. Weincorporate various typical optical character recognition (OCR) tasks into ourevaluation and broaden the concept of text-based generation tasks into OCRgenerative tasks. We select 33 representative tasks and categorize them intofive categories: document, handwritten text, scene text, artistic text, andcomplex \& layout-rich text. For comprehensive evaluation, we examine sixmodels across both closed-source and open-source domains, using tailored,high-quality image inputs and prompts. Through this evaluation, we draw crucialobservations and identify the weaknesses of current generative models for OCRtasks. We argue that photorealistic text image generation and editing should beinternalized as foundational skills into general-domain generative models,rather than being delegated to specialized solutions, and we hope thisempirical analysis can provide valuable insights for the community to achievethis goal. This evaluation is online and will be continuously updated at ourGitHub repository.

Quick Read (beta)

loading the full paper ...