High-quality video inpainting that completes missing regions in video framesis a promising yet challenging task. State-of-the-art approaches adoptattention models to complete a frame by searching missing contents fromreference frames, and further complete whole videos frame by frame. However,these approaches can suffer from inconsistent attention results along spatialand temporal dimensions, which often leads to blurriness and temporal artifactsin videos. In this paper, we propose to learn a joint Spatial-TemporalTransformer Network (STTN) for video inpainting. Specifically, wesimultaneously fill missing regions in all input frames by self-attention, andpropose to optimize STTN by a spatial-temporal adversarial loss. To show thesuperiority of the proposed model, we conduct both quantitative and qualitativeevaluations by using standard stationary masks and more realistic moving objectmasks. Demo videos are available at https://github.com/researchmm/STTN.