Abstract
Sketches are a natural and accessible medium for UI designers toconceptualize early-stage ideas. However, existing research on UI/UX automationoften requires high-fidelity inputs like Figma designs or detailed screenshots,limiting accessibility and impeding efficient design iteration. To bridge thisgap, we introduce Sketch2Code, a benchmark that evaluates state-of-the-artVision Language Models (VLMs) on automating the conversion of rudimentarysketches into webpage prototypes. Beyond end-to-end benchmarking, Sketch2Codesupports interactive agent evaluation that mimics real-world design workflows,where a VLM-based agent iteratively refines its generations by communicatingwith a simulated user, either passively receiving feedback instructions orproactively asking clarification questions. We comprehensively analyze tencommercial and open-source models, showing that Sketch2Code is challenging forexisting VLMs; even the most capable models struggle to accurately interpretsketches and formulate effective questions that lead to steady improvement.Nevertheless, a user study with UI/UX experts reveals a significant preferencefor proactive question-asking over passive feedback reception, highlighting theneed to develop more effective paradigms for multi-turn conversational agents.