Beyond the existing single-person and multiple-person human parsing tasks instatic images, this paper makes the first attempt to investigate a morerealistic video instance-level human parsing that simultaneously segments outeach person instance and parses each instance into more fine-grained parts(e.g., head, leg, dress). We introduce a novel Adaptive Temporal EncodingNetwork (ATEN) that alternatively performs temporal encoding among key framesand flow-guided feature propagation from other consecutive frames between twokey frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce theinstance-level parsing result for each key frame, which integrates both theglobal human parsing and instance-level human segmentation into a unifiedmodel. To balance between accuracy and efficiency, the flow-guided featurepropagation is used to directly parse consecutive frames according to theiridentified temporal consistency with key frames. On the other hand, ATENleverages the convolution gated recurrent units (convGRU) to exploit temporalchanges over a series of key frames, which are further used to facilitate theframe-level instance-level parsing. By alternatively performing direct featurepropagation between consistent frames and temporal encoding network among keyframes, our ATEN achieves a good balance between frame-level accuracy and timeefficiency, which is a common crucial problem in video object segmentationresearch. To demonstrate the superiority of our ATEN, extensive experiments areconducted on the most popular video segmentation benchmark (DAVIS) and a newlycollected Video Instance-level Parsing (VIP) dataset, which is the first videoinstance-level human parsing dataset comprised of 404 sequences and over 20kframes with instance-level and pixel-wise annotations.