Abstract
Large language models (LLMs) have become the cornerstone of modern AI.However, the existing paradigm of next-token prediction fundamentally limitstheir ability to form coherent, high-level concepts, making it a criticalbarrier to human-like understanding and reasoning. Take the phrase "ribonucleicacid" as an example: an LLM will first decompose it into tokens, i.e.,artificial text fragments ("rib", "on", ...), then learn each tokensequentially, rather than grasping the phrase as a unified, coherent semanticentity. This fragmented representation hinders deeper conceptual understandingand, ultimately, the development of truly intelligent systems. In response, weintroduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training methodthat redefines how LLMs are fine-tuned. By enabling the learning of sequencesthat span multiple tokens, this method fosters stronger concept-aware learning.Our experiments demonstrate significant improvements compared to conventionalnext-token finetuning methods across diverse tasks, including traditionalapplications like text summarization and domain-specific ones like de novoprotein design. Multi-token prediction was previously only possible in theprohibitively expensive pretraining phase; CAFT, to our knowledge, is the firstto bring the multi-token setting to the post-training phase, thus effectivelydemocratizing its benefits for the broader community of practitioners andresearchers. Finally, the unexpected effectiveness of our proposed methodsuggests wider implications for the machine learning research community. Allcode and data are available at https://github.com/michaelchen-lab/caft-llm