Abstract
Recent advancements in Multimodal Large Language Models (MM-LLMs) havedemonstrated promising potential in terms of generalization and robustness whenapplied to different modalities. While previous works have already achieved 3Dhuman motion generation using various approaches including language modeling,they mostly % are mostly carefully designed use specialized architecture andare restricted to single-human motion generation. Inspired by the success ofMM-LLMs, we propose MotionLLM, a simple and general framework that can achievesingle-human, multi-human motion generation, and motion captioning byfine-tuning pre-trained LLMs. Specifically, we encode and quantize motions intodiscrete LLM-understandable tokens, which results in a unified vocabularyconsisting of both motion and text tokens. With only 1--3% parameters of theLLMs trained by using adapters, our single-human motion generation achievescomparable results to those diffusion models and other trained-from-scratchtransformer-based models. Additionally, we show that our approach is scalableand flexible, allowing easy extension to multi-human motion generation throughautoregressive generation of single-human motions. Project page:https://knoxzhao.github.io/MotionLLM