Abstract
Recent advances in language models opened new opportunities to addresscomplex schema matching tasks. Schema matching approaches have been proposedthat demonstrate the usefulness of language models, but they have alsouncovered important limitations: Small language models (SLMs) require trainingdata (which can be both expensive and challenging to obtain), and largelanguage models (LLMs) often incur high computational costs and must deal withconstraints imposed by context windows. We present Magneto, a cost-effectiveand accurate solution for schema matching that combines the advantages of SLMsand LLMs to address their limitations. By structuring the schema matchingpipeline in two phases, retrieval and reranking, Magneto can usecomputationally efficient SLM-based strategies to derive candidate matcheswhich can then be reranked by LLMs, thus making it possible to reduce runtimewithout compromising matching accuracy. We propose a self-supervised approachto fine-tune SLMs which uses LLMs to generate syntactically diverse trainingdata, and prompting strategies that are effective for reranking. We alsointroduce a new benchmark, developed in collaboration with domain experts,which includes real biomedical datasets and presents new challenges to schemamatching methods. Through a detailed experimental evaluation, using both ournew and existing benchmarks, we show that Magneto is scalable and attains highaccuracy for datasets from different domains.