Abstract
This paper aims to model 3D human motion across domains, where a single modelis expected to handle multiple modalities, tasks, and datasets. Existingcross-domain models often rely on domain-specific components and multi-stagetraining, which limits their practicality and scalability. To overcome thesechallenges, we propose a new setting to train a unified cross-domain modelthrough a single process, eliminating the need for domain-specific componentsand multi-stage training. We first introduce Pose-in-Context (PiC), whichleverages in-context learning to create a pose-centric cross-domain model.While PiC generalizes across multiple pose-based tasks and datasets, itencounters difficulties with modality diversity, prompting strategy, andcontextual dependency handling. We thus propose Human-in-Context (HiC), anextension of PiC that broadens generalization across modalities, tasks, anddatasets. HiC combines pose and mesh representations within a unifiedframework, expands task coverage, and incorporates larger-scale datasets.Additionally, HiC introduces a max-min similarity prompt sampling strategy toenhance generalization across diverse domains and a network architecture withdual-branch context injection for improved handling of contextual dependencies.Extensive experimental results show that HiC performs better than PiC in termsof generalization, data scale, and performance across a wide range of domains.These results demonstrate the potential of HiC for building a unifiedcross-domain 3D human motion model with improved flexibility and scalability.The source codes and models are available athttps://github.com/BradleyWang0416/Human-in-Context.