Creating a vivid video from the event or scenario in our imagination is atruly fascinating experience. Recent advancements in text-to-video synthesishave unveiled the potential to achieve this with prompts only. While text isconvenient in conveying the overall scene context, it may be insufficient tocontrol precisely. In this paper, we explore customized video generation byutilizing text as context description and motion structure (e.g. frame-wisedepth) as concrete guidance. Our method, dubbed Make-Your-Video, involvesjoint-conditional video generation using a Latent Diffusion Model that ispre-trained for still image synthesis and then promoted for video generationwith the introduction of temporal modules. This two-stage learning scheme notonly reduces the computing resources required, but also improves theperformance by transferring the rich concepts available in image datasetssolely into video generation. Moreover, we use a simple yet effective causalattention mask strategy to enable longer video synthesis, which mitigates thepotential quality degradation effectively. Experimental results show thesuperiority of our method over existing baselines, particularly in terms oftemporal coherence and fidelity to users' guidance. In addition, our modelenables several intriguing applications that demonstrate potential forpractical usage.