TreeDiff: AST-Guided Code Generation with Diffusion LLMs

Abstract

Recent advances in diffusion-based language models have opened newpossibilities for controllable and bidirectional sequence generation. Thesemodels provide an alternative to traditional autoregressive approaches byframing text generation as an iterative denoising process. However, applyingdiffusion models to structured domains such as source code remains asignificant challenge. Programming languages differ from natural language inthat they follow strict syntactic and semantic rules, with hierarchicalorganization that must be preserved for correctness. Standard token-levelcorruption techniques used during training often ignore this structure, whichmay hinder the model's ability to learn meaningful representations of code. Toaddress this limitation, we propose a syntax-aware diffusion framework thatincorporates structural priors from Abstract Syntax Trees (ASTs) into thedenoising process. Instead of masking individual tokens at random, weselectively corrupt syntactically meaningful code spans derived from ASTsubtrees. This enables the model to reconstruct programs in a way that respectsgrammatical boundaries and captures long-range dependencies. Experimentalresults demonstrate that syntax-aware corruption significantly improvessyntactic correctness, reconstruction accuracy, and generalization to unseencode patterns. These findings highlight the potential of incorporatingstructural information into diffusion-based training and suggest thatsyntax-guided denoising is a promising direction for advancing diffusion-basedlanguage models in code generation tasks.

Quick Read (beta)

loading the full paper ...