DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

Abstract

Personalized image generation has emerged as a promising direction inmultimodal content creation. It aims to synthesize images tailored toindividual style preferences (e.g., color schemes, character appearances,layout) and semantic intentions (e.g., emotion, action, scene contexts) byleveraging user-interacted history images and multimodal instructions. Despitenotable progress, existing methods -- whether based on diffusion models, largelanguage models, or Large Multimodal Models (LMMs) -- struggle to accuratelycapture and fuse user style preferences and semantic intentions. In particular,the state-of-the-art LMM-based method suffers from the entanglement of visualfeatures, leading to Guidance Collapse, where the generated images fail topreserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized imagegeneration framework that enhances LMMs through Disentangled RepresentationComposition. DRC explicitly extracts user style preferences and semanticintentions from history images and the reference image, respectively, to formuser-specific latent instructions that guide image generation within LMMs.Specifically, it involves two critical learning stages: 1) Disentanglementlearning, which employs a dual-tower disentangler to explicitly separate styleand semantic features, optimized via a reconstruction-driven paradigm withdifficulty-aware importance sampling; and 2) Personalized modeling, whichapplies semantic-preserving augmentations to effectively adapt the disentangledrepresentations for robust personalized generation. Extensive experiments ontwo benchmarks demonstrate that DRC shows competitive performance whileeffectively mitigating the guidance collapse issue, underscoring the importanceof disentangled representation learning for controllable and effectivepersonalized image generation.

Quick Read (beta)

loading the full paper ...