Abstract
Large Language Models (LLMs) often produce answers with a singlechain-of-thought, which restricts their ability to explore reasoning paths orself-correct flawed outputs in complex tasks. In this paper, we introduce MALT(Multi-Agent LLM Training), a novel post-training strategy that divides thereasoning process into generation, verification, and refinement steps using asequential pipeline of heterogeneous agents. During data generation, each agentis repeatedly sampled to form a multi-agent search tree, where final outputsare graded against ground-truth data. We then apply value iteration topropagate reward signals back to each role-conditioned model, automaticallyproducing multi-agent post-training data without human or teacher-modelsupervision. Our off-policy approach allows each agent to specialize bylearning from correct and incorrect trajectories, ultimately improving theend-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the samebaseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40%respectively, making it an important advance towards multi-agent cooperativetraining.