Simplifying Models with Unlabeled Output Data

Abstract

We focus on prediction problems with high-dimensional outputs that aresubject to output validity constraints, e.g. a pseudocode-to-code translationtask where the code must compile. For these problems, labeled input-outputpairs are expensive to obtain, but "unlabeled" outputs, i.e. outputs withoutcorresponding inputs, are freely available and provide information about outputvalidity (e.g. code on GitHub). In this paper, we present predict-and-denoise,a framework that can leverage unlabeled outputs. Specifically, we first train adenoiser to map possibly invalid outputs to valid outputs using syntheticperturbations of the unlabeled outputs. Second, we train a predictor composedwith this fixed denoiser. We show theoretically that for a family of functionswith a discrete valid output space, composing with a denoiser reduces thecomplexity of a 2-layer ReLU network needed to represent the function and thatthis complexity gap can be arbitrarily large. We evaluate the frameworkempirically on several datasets, including image generation from attributes andpseudocode-to-code translation. On the SPoC~pseudocode-to-code dataset, ourframework improves the proportion of code outputs that pass all test cases by3-4% over a baseline Transformer.

Quick Read (beta)

loading the full paper ...