Abstract
Text-to-motion generation has recently garnered significant researchinterest, primarily focusing on generating human motion sequences in blankbackgrounds. However, human motions commonly occur within diverse 3D scenes,which has prompted exploration into scene-aware text-to-motion generationmethods. Yet, existing scene-aware methods often rely on large-scaleground-truth motion sequences in diverse 3D scenes, which poses practicalchallenges due to the expensive cost. To mitigate this challenge, we are thefirst to propose a \textbf{T}raining-free \textbf{S}cene-aware\textbf{T}ext-to-\textbf{Motion} framework, dubbed as \textbf{TSTMotion}, thatefficiently empowers pre-trained blank-background motion generators with thescene-aware capability. Specifically, conditioned on the given 3D scene andtext description, we adopt foundation models together to reason, predict andvalidate a scene-aware motion guidance. Then, the motion guidance isincorporated into the blank-background motion generators with twomodifications, resulting in scene-aware text-driven motion sequences. Extensiveexperiments demonstrate the efficacy and generalizability of our proposedframework. We release our code in \href{https://tstmotion.github.io/}{ProjectPage}.