Today's Mixed Reality head-mounted displays track the user's head pose inworld space as well as the user's hands for interaction in both AugmentedReality and Virtual Reality scenarios. While this is adequate to support userinput, it unfortunately limits users' virtual representations to just theirupper bodies. Current systems thus resort to floating avatars, whose limitationis particularly evident in collaborative settings. To estimate full-body posesfrom the sparse input sources, prior work has incorporated additional trackersand sensors at the pelvis or lower body, which increases setup complexity andlimits practical application in mobile settings. In this paper, we presentAvatarPoser, the first learning-based method that predicts full-body poses inworld coordinates using only motion input from the user's head and hands. Ourmethod builds on a Transformer encoder to extract deep features from the inputsignals and decouples global motion from the learned local joint orientationsto guide pose estimation. To obtain accurate full-body motions that resemblemotion capture animations, we refine the arm joints' positions using anoptimization routine with inverse kinematics to match the original trackinginput. In our evaluation, AvatarPoser achieved new state-of-the-art results inevaluations on large motion capture datasets (AMASS). At the same time, ourmethod's inference speed supports real-time operation, providing a practicalinterface to support holistic avatar control and representation for Metaverseapplications.