Abstract
Consider a prediction setting with few in-distribution labeled examples andmany unlabeled examples both in- and out-of-distribution (OOD). The goal is tolearn a model which performs well both in-distribution and OOD. In thesesettings, auxiliary information is often cheaply available for every input. Howshould we best leverage this auxiliary information for the prediction task?Empirically across three image and time-series datasets, and theoretically in amulti-task linear regression setting, we show that (i) using auxiliaryinformation as input features improves in-distribution error but can hurt OODerror; but (ii) using auxiliary information as outputs of auxiliarypre-training tasks improves OOD error. To get the best of both worlds, weintroduce In-N-Out, which first trains a model with auxiliary inputs and usesit to pseudolabel all the in-distribution inputs, then pre-trains a model onOOD auxiliary outputs and fine-tunes this model with the pseudolabels(self-training). We show both theoretically and empirically that In-N-Outoutperforms auxiliary inputs or outputs alone on both in-distribution and OODerror.