Abstract
This work presents AnyDoor, a diffusion-based image generator with the powerto teleport target objects to new scenes at user-specified locations in aharmonious way. Instead of tuning parameters for each object, our model istrained only once and effortlessly generalizes to diverse object-scenecombinations at the inference stage. Such a challenging zero-shot settingrequires an adequate characterization of a certain object. To this end, wecomplement the commonly used identity feature with detail features, which arecarefully designed to maintain texture details yet allow versatile localvariations (e.g., lighting, orientation, posture, etc.), supporting the objectin favorably blending with different surroundings. We further propose to borrowknowledge from video datasets, where we can observe various forms (i.e., alongthe time axis) of a single object, leading to stronger model generalizabilityand robustness. Extensive experiments demonstrate the superiority of ourapproach over existing alternatives as well as its great potential inreal-world applications, such as virtual try-on and object moving. Project pageis https://damo-vilab.github.io/AnyDoor-Page/.