In this paper, we propose a multi-stage and high-resolution model for imagesynthesis that uses fine-grained attributes and masks as input. With afine-grained attribute, the proposed model can detailedly constrain thefeatures of the generated image through rich and fine-grained semanticinformation in the attribute. With mask as prior, the model in this paper isconstrained so that the generated images conform to visual senses, which willreduce the unexpected diversity of samples generated from the generativeadversarial network. This paper also proposes a scheme to improve thediscriminator of the generative adversarial network by simultaneouslydiscriminating the total image and sub-regions of the image. In addition, wepropose a method for optimizing the labeled attribute in datasets, whichreduces the manual labeling noise. Extensive quantitative results show that ourimage synthesis model generates more realistic images.