STRIVE: Scene Text Replacement In Videos

Abstract

We propose replacing scene text in videos using deep style transfer andlearned photometric transformations.Building on recent progress on still imagetext replacement,we present extensions that alter text while preserving theappearance and motion characteristics of the original video.Compared to theproblem of still image text replacement,our method addresses additionalchallenges introduced by video, namely effects induced by changing lighting,motion blur, diverse variations in camera-object pose over time,andpreservation of temporal consistency. We parse the problem into three steps.First, the text in all frames is normalized to a frontal pose using aspatio-temporal trans-former network. Second, the text is replaced in a singlereference frame using a state-of-art still-image text replacement method.Finally, the new text is transferred from the reference to remaining framesusing a novel learned image transformation network that captures lighting andblur effects in a temporally consistent manner. Results on synthetic andchallenging real videos show realistic text trans-fer, competitive quantitativeand qualitative performance,and superior inference speed relative toalternatives. We introduce new synthetic and real-world datasets with pairedtext objects. To the best of our knowledge this is the first attempt at deepvideo text replacement.

Quick Read (beta)

loading the full paper ...