While deep learning has reshaped the classical motion capture pipeline,generative, analysis-by-synthesis elements are still in use to recover finedetails if a high-quality 3D model of the user is available. Unfortunately,obtaining such a model for every user a priori is challenging, time-consuming,and limits the application scenarios. We propose a novel test-time optimizationapproach for monocular motion capture that learns a volumetric body model ofthe user in a self-supervised manner. To this end, our approach combines theadvantages of neural radiance fields with an articulated skeletonrepresentation. Our proposed skeleton embedding serves as a common referencethat links constraints across time, thereby reducing the number of requiredcamera views from traditionally dozens of calibrated cameras, down to a singleuncalibrated one. As a starting point, we employ the output of an off-the-shelfmodel that predicts the 3D skeleton pose. The volumetric body shape andappearance is then learned from scratch, while jointly refining the initialpose estimate. Our approach is self-supervised and does not require anyadditional ground truth labels for appearance, pose, or 3D shape. Wedemonstrate that our novel combination of a discriminative pose estimationtechnique with surface-free analysis-by-synthesis outperforms purelydiscriminative monocular pose estimation approaches and generalizes well tomultiple views.