Abstract
We present Ditto, a novel entity matching system based on pre-trainedTransformer-based language models. We fine-tune and cast EM as a sequence-pairclassification problem to leverage such models with a simple architecture. Ourexperiments show that a straightforward application of language models such asBERT, DistilBERT, or RoBERTa pre-trained on large text corpora alreadysignificantly improves the matching quality and outperforms previousstate-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. Wealso developed three optimization techniques to further improve Ditto'smatching capability. Ditto allows domain knowledge to be injected byhighlighting important pieces of input information that may be of interest whenmaking matching decisions. Ditto also summarizes strings that are too long sothat only the essential information is retained and used for EM. Finally, Dittoadapts a SOTA technique on data augmentation for text to EM to augment thetraining data with (difficult) examples. This way, Ditto is forced to learn"harder" to improve the model's matching capability. The optimizations wedeveloped further boost the performance of Ditto by up to 9.8%. Perhaps moresurprisingly, we establish that Ditto can achieve the previous SOTA resultswith at most half the number of labeled data. Finally, we demonstrate Ditto'seffectiveness on a real-world large-scale EM task. On matching two companydatasets consisting of 789K and 412K records, Ditto achieves a high F1 score of96.5%.