Symbolic Graphics Programming with Large Language Models

Abstract

Large language models (LLMs) excel at program synthesis, yet their ability toproduce symbolic graphics programs (SGPs) that render into precise visualcontent remains underexplored. We study symbolic graphics programming, wherethe goal is to generate an SGP from a natural-language description. This taskalso serves as a lens into how LLMs understand the visual world by promptingthem to generate images rendered from SGPs. Among various SGPs, our papersticks to scalable vector graphics (SVGs). We begin by examining the extent towhich LLMs can generate SGPs. To this end, we introduce SGP-GenBench, acomprehensive benchmark covering object fidelity, scene fidelity, andcompositionality (attribute binding, spatial relations, numeracy). OnSGP-GenBench, we discover that frontier proprietary models substantiallyoutperform open-source models, and performance correlates well with generalcoding capabilities. Motivated by this gap, we aim to improve LLMs' ability togenerate SGPs. We propose a reinforcement learning (RL) with verifiable rewardsapproach, where a format-validity gate ensures renderable SVG, and across-modal reward aligns text and the rendered image via strong visionencoders (e.g., SigLIP for text-image and DINO for image-image). Applied toQwen-2.5-7B, our method substantially improves SVG generation quality andsemantics, achieving performance on par with frontier systems. We furtheranalyze training dynamics, showing that RL induces (i) finer decomposition ofobjects into controllable primitives and (ii) contextual details that improvescene coherence. Our results demonstrate that symbolic graphics programmingoffers a precise and interpretable lens on cross-modal grounding.

Quick Read (beta)

loading the full paper ...