DOM-LM: Learning Generalizable Representations for HTML Documents

Abstract

HTML documents are an important medium for disseminating information on theWeb for human consumption. An HTML document presents information in multipletext formats including unstructured text, structured key-value pairs, andtables. Effective representation of these documents is essential for machineunderstanding to enable a wide range of applications, such as QuestionAnswering, Web Search, and Personalization. Existing work has eitherrepresented these documents using visual features extracted by rendering themin a browser, which is typically computationally expensive, or has simplytreated them as plain text documents, thereby failing to capture usefulinformation presented in their HTML structure. We argue that the text and HTMLstructure together convey important semantics of the content and thereforewarrant a special treatment for their representation learning. In this paper,we introduce a novel representation learning approach for web pages, dubbedDOM-LM, which addresses the limitations of existing approaches by encoding bothtext and DOM tree structure with a transformer-based encoder and learninggeneralizable representations for HTML documents via self-supervisedpre-training. We evaluate DOM-LM on a variety of webpage understanding tasks,including Attribute Extraction, Open Information Extraction, and QuestionAnswering. Our extensive experiments show that DOM-LM consistently outperformsall baselines designed for these tasks. In particular, DOM-LM demonstratesbetter generalization performance both in few-shot and zero-shot settings,making it attractive for making it suitable for real-world application settingswith limited labeled data.

Quick Read (beta)

loading the full paper ...