Abstract
Large language models (LLMs) can spend extra compute during inference togenerate intermediate thoughts, which helps to produce better final responses.Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques havebeen proposed such as Rephrase and Respond (Deng et al., 2023a), System 2Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al.,2023). In this work we investigate self-supervised methods to ``compile''(distill) higher quality outputs from System 2 techniques back into LLMgenerations without intermediate reasoning token sequences, as this reasoninghas been distilled into System 1. We show that several such techniques can besuccessfully distilled, resulting in improved results compared to the originalSystem 1 performance, and with less inference cost than System 2. We posit thatsuch System 2 distillation will be an important feature of future continuallylearning AI systems, enabling them to focus System 2 capabilities on thereasoning tasks that they cannot yet do well.