Abstract
Converting webpage design into functional UI code is a critical step forbuilding websites, which can be labor-intensive and time-consuming. To automatethis design-to-code transformation process, various automated methods usinglearning-based networks and multi-modal large language models (MLLMs) have beenproposed. However, these studies were merely evaluated on a narrow range ofstatic web pages and ignored dynamic interaction elements, making them lesspractical for real-world website deployment. To fill in the blank, we present the first systematic investigation of MLLMsin generating interactive webpages. Specifically, we first formulate theInteraction-to-Code task and build the Interaction2Code benchmark that contains97 unique web pages and 213 distinct interactions, spanning 15 webpage typesand 30 interaction categories. We then conduct comprehensive experiments onthree state-of-the-art (SOTA) MLLMs using both automatic metrics and humanevaluations, thereby summarizing six findings accordingly. Our experimentalresults highlight the limitations of MLLMs in generating fine-grainedinteractive features and managing interactions with complex transformations andsubtle visual modifications. We further analyze failure cases and theirunderlying causes, identifying 10 common failure types and assessing theirseverity. Additionally, our findings reveal three critical influencing factors,i.e., prompts, visual saliency, and textual descriptions, that can enhance theinteraction generation performance of MLLMs. Based on these findings, we elicitimplications for researchers and developers, providing a foundation for futureadvancements in this field. Datasets and source code are available athttps://github.com/WebPAI/Interaction2Code.