It is common practice to represent spoken languages at their phonetic level.However, for sign languages, this implies breaking motion into its constituentmotion primitives. Avatar based Sign Language Production (SLP) hastraditionally done just this, building up animation from sequences of handmotions, shapes and facial expressions. However, more recent deep learningbased solutions to SLP have tackled the problem using a single network thatestimates the full skeletal structure. We propose splitting the SLP task into two distinct jointly-trainedsub-tasks. The first translation sub-task translates from spoken language to alatent sign language representation, with gloss supervision. Subsequently, theanimation sub-task aims to produce expressive sign language sequences thatclosely resemble the learnt spatio-temporal representation. Using a progressivetransformer for the translation sub-task, we propose a novel Mixture of MotionPrimitives (MoMP) architecture for sign language animation. A set of distinctmotion primitives are learnt during training, that can be temporally combinedat inference to animate continuous sign language sequences. We evaluate on the challenging RWTH-PHOENIX-Weather-2014T(PHOENIX14T)dataset, presenting extensive ablation studies and showing that MoMPoutperforms baselines in user evaluations. We achieve state-of-the-art backtranslation performance with an 11% improvement over competing results.Importantly, and for the first time, we showcase stronger performance for afull translation pipeline going from spoken language to sign, than from glossto sign.