Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

Abstract

Data programming is a programmatic weak supervision approach to efficientlycurate large-scale labeled training data. Writing data programs (labelingfunctions) requires, however, both programming literacy and domain expertise.Many subject matter experts have neither programming proficiency nor time toeffectively write data programs. Furthermore, regardless of one's expertise incoding or machine learning, transferring domain expertise into labelingfunctions by enumerating rules and thresholds is not only time consuming butalso inherently difficult. Here we propose a new framework, data programming bydemonstration (DPBD), to generate labeling rules using interactivedemonstrations of users. DPBD aims to relieve the burden of writing labelingfunctions from users, enabling them to focus on higher-level semantics such asidentifying relevant signals for labeling tasks. We operationalize ourframework with Ruler, an interactive system that synthesizes labeling rules fordocument classification by using span-level annotations of users on documentexamples. We compare Ruler with conventional data programming through a userstudy conducted with 10 data scientists creating labeling functions forsentiment and spam classification tasks. We find that Ruler is easier to useand learn and offers higher overall satisfaction, while providingdiscriminative model performances comparable to ones achieved by conventionaldata programming.

Quick Read (beta)

loading the full paper ...