Abstract
Programming-by-Examples (PBE) aims to generate an algorithm from input-outputexamples. Such systems are practically and theoretically important: from anend-user perspective, they are deployed to millions of people, and from an AIperspective, PBE corresponds to a very general form of few-shot inductiveinference. Given the success of Large Language Models (LLMs) in code-generationtasks, we investigate here the extent to which LLMs can be said to have"solved" PBE. We experiment on classic domains such as lists and strings, andan uncommon graphics programming domain not well represented in typicalpretraining data. We find that pretrained models are not effective at PBE, butthat they can be fine-tuned for much higher performance, provided the testproblems are in-distribution. We analyze empirically what causes these modelsto succeed and fail, and take steps toward understanding how to achieve betterout-of-distribution generalization. Collectively these results suggest thatLLMs make strong progress toward solving the typical suite of PBE tasks,potentially increasing the flexibility and applicability of PBE systems, whilealso identifying ways in which LLMs still fall short.