Language-Agnostic Website Embedding and Classification

  • 2022-01-10 22:31:48
  • Sylvain Lugeon, Tiziano Piccardi, Robert West
  • 1

Abstract

Currently, publicly available models for website classification do not offeran embedding method and have limited support for languages beyond English. Werelease a dataset with more than 1M websites in 92 languages with relativelabels collected from Curlie, the largest multilingual crowdsourced Webdirectory. The dataset contains 14 website categories aligned across languages.Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained modelfor classifying and embedding websites based on their homepage in alanguage-agnostic way. Homepage2Vec, thanks to its feature set (textualcontent, metadata tags, and visual attributes) and recent progress in naturallanguage representation, is language-independent by design and can generateembeddings representation. We show that Homepage2Vec correctly classifieswebsites with a macro-averaged F1-score of 0.90, with stable performance acrosslow- as well as high-resource languages. Feature analysis shows that a smallsubset of efficiently computable features suffices to achieve high performanceeven with limited computational resources. We make publicly available thecurated Curlie dataset aligned across languages, the pre-trained Homepage2Vecmodel, and libraries.

 

Quick Read (beta)

loading the full paper ...