Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Abstract

Given the remarkable results of motion synthesis with diffusion models, anatural question arises: how can we effectively leverage these models formotion editing? Existing diffusion-based motion editing methods overlook theprofound potential of the prior embedded within the weights of pre-trainedmodels, which enables manipulating the latent feature space; hence, theyprimarily center on handling the motion space. In this work, we explore theattention mechanism of pre-trained motion diffusion models. We uncover theroles and interactions of attention elements in capturing and representingintricate human motion patterns, and carefully integrate these elements totransfer a leader motion to a follower one while maintaining the nuancedcharacteristics of the follower, resulting in zero-shot motion transfer.Editing features associated with selected motions allows us to confront achallenge observed in prior motion diffusion approaches, which use generaldirectives (e.g., text, music) for editing, ultimately failing to convey subtlenuances effectively. Our work is inspired by how a monkey closely imitates whatit sees while maintaining its unique motion patterns; hence we call it MonkeySee, Monkey Do, and dub it MoMo. Employing our technique enables accomplishingtasks such as synthesizing out-of-distribution motions, style transfer, andspatial editing. Furthermore, diffusion inversion is seldom employed formotions; as a result, editing efforts focus on generated motions, limiting theeditability of real ones. MoMo harnesses motion inversion, extending itsapplication to both real and generated motions. Experimental results show theadvantage of our approach over the current art. In particular, unlike methodstailored for specific applications through training, our approach is applied atinference time, requiring no training. Our webpage is athttps://monkeyseedocg.github.io.

Quick Read (beta)

loading the full paper ...