Text-guided human motion generation has drawn significant interest because ofits impactful applications spanning animation and robotics. Recently,application of diffusion models for motion generation has enabled improvementsin the quality of generated motions. However, existing approaches are limitedby their reliance on relatively small-scale motion capture data, leading topoor performance on more diverse, in-the-wild prompts. In this paper, weintroduce Make-An-Animation, a text-conditioned human motion generation modelwhich learns more diverse poses and prompts from large-scale image-textdatasets, enabling significant improvement in performance over prior works.Make-An-Animation is trained in two stages. First, we train on a curatedlarge-scale dataset of (text, static pseudo-pose) pairs extracted fromimage-text datasets. Second, we fine-tune on motion capture data, addingadditional layers to model the temporal dimension. Unlike prior diffusionmodels for motion generation, Make-An-Animation uses a U-Net architecturesimilar to recent text-to-video generation models. Human evaluation of motionrealism and alignment with input text shows that our model reachesstate-of-the-art performance on text-to-motion generation.