No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Abstract

Recent studies have demonstrated that learning a meaningful internalrepresentation can both accelerate generative training and enhance generationquality of the diffusion transformers. However, existing approaches necessitateto either introduce an additional and complex representation training frameworkor rely on a large-scale, pre-trained representation foundation model toprovide representation guidance during the original generative trainingprocess. In this study, we posit that the unique discriminative processinherent to diffusion transformers enables them to offer such guidance withoutrequiring external representation components. We therefore proposeSelf-Representation A}lignment (SRA), a simple yet straightforward method thatobtain representation guidance through a self-distillation manner.Specifically, SRA aligns the output latent representation of the diffusiontransformer in earlier layer with higher noise to that in later layer withlower noise to progressively enhance the overall representation learning duringonly generative training process. Experimental results indicate that applyingSRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRAnot only significantly outperforms approaches relying on auxiliary, complexrepresentation training frameworks but also achieves performance comparable tomethods that heavily dependent on powerful external representation priors.

Quick Read (beta)

loading the full paper ...