Deep learning greatly improved the realism of animatable human models bylearning geometry and appearance from collections of 3D scans, template meshes,and multi-view imagery. High-resolution models enable photo-realistic avatarsbut at the cost of requiring studio settings not available to end users. Ourgoal is to create avatars directly from raw images without relying on expensivestudio setups and surface tracking. While a few such approaches exist, thosehave limited generalization capabilities and are prone to learning spurious(chance) correlations between irrelevant body parts, resulting in implausibledeformations and missing body parts on unseen poses. We introduce a three-stagemethod that induces two inductive biases to better disentangled pose-dependentdeformation. First, we model correlations of body parts explicitly with a graphneural network. Second, to further reduce the effect of chance correlations, weintroduce localized per-bone features that use a factorized volumetricrepresentation and a new aggregation function. We demonstrate that our modelproduces realistic body shapes under challenging unseen poses and showshigh-quality image synthesis. Our proposed representation strikes a bettertrade-off between model capacity, expressiveness, and robustness than competingmethods. Project website: https://lemonatsu.github.io/danbo.