FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

Abstract

Action customization involves generating videos where the subject performsactions dictated by input control signals. Current methods use pose-guided orglobal motion customization but are limited by strict constraints on spatialstructure, such as layout, skeleton, and viewpoint consistency, reducingadaptability across diverse subjects and scenarios. To overcome theselimitations, we propose FlexiAct, which transfers actions from a referencevideo to an arbitrary target image. Unlike existing methods, FlexiAct allowsfor variations in layout, viewpoint, and skeletal structure between the subjectof the reference video and the target image, while maintaining identityconsistency. Achieving this requires precise action control, spatial structureadaptation, and consistency preservation. To this end, we introduce RefAdapter,a lightweight image-conditioned adapter that excels in spatial adaptation andconsistency preservation, surpassing existing methods in balancing appearanceconsistency and structural flexibility. Additionally, based on ourobservations, the denoising process exhibits varying levels of attention tomotion (low frequency) and appearance details (high frequency) at differenttimesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlikeexisting methods that rely on separate spatial-temporal architectures, directlyachieves action extraction during the denoising process. Experimentsdemonstrate that our method effectively transfers actions to subjects withdiverse layouts, skeletons, and viewpoints. We release our code and modelweights to support further research athttps://shiyi-zh0408.github.io/projectpages/FlexiAct/

Quick Read (beta)

loading the full paper ...