Abstract
We present DIMO, a generative approach capable of generating diverse 3Dmotions for arbitrary objects from a single image. The core idea of our work isto leverage the rich priors in well-trained video models to extract the commonmotion patterns and then embed them into a shared low-dimensional latent space.Specifically, we first generate multiple videos of the same object with diversemotions. We then embed each motion into a latent vector and train a sharedmotion decoder to learn the distribution of motions represented by a structuredand compact motion representation, i.e., neural key point trajectories. Thecanonical 3D Gaussians are then driven by these key points and fused to modelthe geometry and appearance. During inference time with learned latent space,we can instantly sample diverse 3D motions in a single-forward pass and supportseveral interesting applications including 3D motion interpolation andlanguage-guided motion generation. Our project page is available athttps://linzhanm.github.io/dimo.