Sato: Contextual Semantic Type Detection in Tables

  • 2019-11-14 18:51:59
  • Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, Çağatay Demiralp, Wang-Chiew Tan
  • 43

Abstract

Detecting the semantic types of data columns in relational tables isimportant for various data preparation and information retrieval tasks such asdata cleaning, schema matching, data discovery, and semantic search. However,existing detection approaches either perform poorly with dirty data, supportonly a limited number of semantic types, fail to incorporate the table contextof columns or rely on large sample sizes in the training data. We introduceSato, a hybrid machine learning model to automatically detect the semantictypes of columns in tables, exploiting the signals from the context as well asthe column values. Sato combines a deep learning model trained on a large-scaletable corpus with topic modeling and structured prediction to achievesupport-weighted and macro average F1 scores of 0.901 and 0.973, respectively,exceeding the state-of-the-art performance by a significant margin. Weextensively analyze the overall and per-type performance of Sato, discussinghow individual modeling components, as well as feature categories, contributeto its performance.

 

Quick Read (beta)

loading the full paper ...