Pose guided person image generation means to generate a photo-realisticperson image conditioned on an input person image and a desired pose. This taskrequires spatial manipulation of the source image according to the target pose.However, the generative adversarial networks (GANs) widely used for imagegeneration and translation rely on spatially local and translation equivariantoperators, i.e., convolution, pooling and unpooling, which cannot handle largeimage deformation. This paper introduces a novel two-stream appearance transfernetwork (2s-ATN) to address this challenge. It is a multi-stage architectureconsisting of a source stream and a target stream. Each stage features anappearance transfer module and several two-stream feature fusion modules. Theformer finds the dense correspondence between the two-stream feature maps andthen transfers the appearance information from the source stream to the targetstream. The latter exchange local information between the two streams andsupplement the non-local appearance transfer. Both quantitative and qualitativeresults indicate the proposed 2s-ATN can effectively handle large spatialdeformation and occlusion while retaining the appearance details. Itoutperforms prior states of the art on two widely used benchmarks.