Abstract
Imitation learning holds the promise of equipping robots with versatileskills by learning from expert demonstrations. However, policies trained onfinite datasets often struggle to generalize beyond the training distribution.In this work, we present a unified perspective on the generalization capabilityof imitation learning, grounded in both information theorey and datadistribution property. We first show that the generalization gap can be upperbounded by (i) the conditional information bottleneck on intermediaterepresentations and (ii) the mutual information between the model parametersand the training dataset. This characterization provides theoretical guidancefor designing effective training strategies in imitation learning, particularlyin determining whether to freeze, fine-tune, or train large pretrained encoders(e.g., vision-language models or vision foundation models) from scratch toachieve better generalization. Furthermore, we demonstrate that highconditional entropy from input to output induces a flatter likelihoodlandscape, thereby reducing the upper bound on the generalization gap. Inaddition, it shortens the stochastic gradient descent (SGD) escape time fromsharp local minima, which may increase the likelihood of reaching global optimaunder fixed optimization budgets. These insights explain why imitation learningoften exhibits limited generalization and underscore the importance of not onlyscaling the diversity of input data but also enriching the variability ofoutput labels conditioned on the same input.