In contrast to the traditional avatar creation pipeline which is a costlyprocess, contemporary generative approaches directly learn the datadistribution from photographs and the state of the arts can now yield highlyphoto-realistic images. While plenty of works attempt to extend theunconditional generative models and achieve some level of controllability, itis still challenging to ensure multi-view consistency, especially in largeposes. In this work, we propose a 3D portrait generation network that produces3D consistent portraits while being controllable according to semanticparameters regarding pose, identity, expression and lighting. The generativenetwork uses neural scene representation to model portraits in 3D, whosegeneration is guided by a parametric face model that supports explicit control.While the latent disentanglement can be further enhanced by contrasting imageswith partially different attributes, there still exists noticeableinconsistency in non-face areas, e.g., hair and background, when animatingexpressions. We solve this by proposing a volume blending strategy in which weform a composite output by blending the dynamic and static radiance fields,with two parts segmented from the jointly learned semantic field. Our methodoutperforms prior arts in extensive experiments, producing realistic portraitswith vivid expression in natural lighting when viewed in free viewpoint. Theproposed method also demonstrates generalization ability to real images as wellas out-of-domain cartoon faces, showing great promise in real applications.Additional video results and code will be available on the project webpage.