Abstract
In this work, we empirically study Diffusion Transformers (DiTs) fortext-to-image generation, focusing on architectural choices, text-conditioningstrategies, and training protocols. We evaluate a range of DiT-basedarchitectures--including PixArt-style and MMDiT variants--and compare them witha standard DiT variant which directly processes concatenated text and noiseinputs. Surprisingly, our findings reveal that the performance of standard DiTis comparable with those specialized models, while demonstrating superiorparameter-efficiency, especially when scaled up. Leveraging the layer-wiseparameter sharing strategy, we achieve a further reduction of 66% in model sizecompared to an MMDiT architecture, with minimal performance impact. Building onan in-depth analysis of critical components such as text encoders andVariational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. Withsupervised and reward fine-tuning, DiT-Air achieves state-of-the-artperformance on GenEval and T2I CompBench, while DiT-Air-Lite remains highlycompetitive, surpassing most existing models despite its compact size.