Text-Driven Stylization of Video Objects

Abstract

We tackle the task of stylizing video objects in an intuitive and semanticmanner following a user-specified text prompt. This is a challenging task asthe resulting video must satisfy multiple properties: (1) it has to betemporally consistent and avoid jittering or similar artifacts, (2) theresulting stylization must preserve both the global semantics of the object andits fine-grained details, and (3) it must adhere to the user-specified textprompt. To this end, our method stylizes an object in a video according to aglobal target text prompt that describes the global semantics and a localtarget text prompt that describes the local semantics. To modify the style ofan object, we harness the representational power of CLIP to get a similarityscore between (1) the local target text and a set of local stylized views, and(2) a global target text and a set of stylized global views. We use apretrained atlas decomposition network to propagate the edits in a temporallyconsistent manner. We demonstrate that our method can generate consistent stylechanges in time for a variety of objects and videos, that adhere to thespecification of the target texts. We also show how varying the specificity ofthe target texts, and augmenting the texts with a set of prefixes results instylizations with different levels of detail. Full results are given on ourproject webpage:https://sloeschcke.github.io/Text-Driven-Stylization-of-Video-Objects/

Quick Read (beta)

loading the full paper ...