Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Abstract

Multimodal large language models (MLLMs) have shown impressive success acrossmodalities such as image, video, and audio in a variety of understanding andgeneration tasks. However, current MLLMs are surprisingly poor at understandingwebpage screenshots and generating their corresponding HTML code. To addressthis problem, we propose Web2Code, a benchmark consisting of a new large-scalewebpage-to-code dataset for instruction tuning and an evaluation framework forthe webpage understanding and HTML code translation abilities of MLLMs. Fordataset construction, we leverage pretrained LLMs to enhance existingwebpage-to-code datasets as well as generate a diverse pool of new webpagesrendered into images. Specifically, the inputs are webpage images andinstructions, while the responses are the webpage's HTML code. We furtherinclude diverse natural language QA pairs about the webpage content in theresponses to enable a more comprehensive understanding of the web content. Toevaluate model performance in these tasks, we develop an evaluation frameworkfor testing MLLMs' abilities in webpage understanding and web-to-codegeneration. Extensive experiments show that our proposed dataset is beneficialnot only to our proposed tasks but also in the general visual domain, whileprevious datasets result in worse performance. We hope our work will contributeto the development of general MLLMs suitable for web-based content generationand task automation. Our data and code will be available athttps://github.com/MBZUAI-LLM/web2code.

Quick Read (beta)

loading the full paper ...