Abstract
Scalable Vector Graphics (SVG) is an important image format widely adopted ingraphic design because of their resolution independence and editability. Thestudy of generating high-quality SVG has continuously drawn attention from bothdesigners and researchers in the AIGC community. However, existing methodseither produces unstructured outputs with huge computational cost or is limitedto generating monochrome icons of over-simplified structures. To producehigh-quality and complex SVG, we propose OmniSVG, a unified framework thatleverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodalSVG generation. By parameterizing SVG commands and coordinates into discretetokens, OmniSVG decouples structural logic from low-level geometry forefficient training while maintaining the expressiveness of complex SVGstructure. To further advance the development of SVG synthesis, we introduceMMSVG-2M, a multimodal dataset with two million richly annotated SVG assets,along with a standardized evaluation protocol for conditional SVG generationtasks. Extensive experiments show that OmniSVG outperforms existing methods anddemonstrates its potential for integration into professional SVG designworkflows.