V-VIPE: Variational View Invariant Pose Embedding

Abstract

Learning to represent three dimensional (3D) human pose given a twodimensional (2D) image of a person, is a challenging problem. In order to makethe problem less ambiguous it has become common practice to estimate 3D pose inthe camera coordinate space. However, this makes the task of comparing two 3Dposes difficult. In this paper, we address this challenge by separating theproblem of estimating 3D pose from 2D images into two steps. We use avariational autoencoder (VAE) to find an embedding that represents 3D poses incanonical coordinate space. We refer to this embedding as variationalview-invariant pose embedding V-VIPE. Using V-VIPE we can encode 2D and 3Dposes and use the embedding for downstream tasks, like retrieval andclassification. We can estimate 3D poses from these embeddings using thedecoder as well as generate unseen 3D poses. The variability of our encodingallows it to generalize well to unseen camera views when mapping from 2D space.To the best of our knowledge, V-VIPE is the only representation to offer thisdiversity of applications. Code and more information can be found athttps://v-vipe.github.io/.

Quick Read (beta)

loading the full paper ...