Abstract
Despite having promising results, style transfer, which requires preparingstyle images in advance, may result in lack of creativity and accessibility.Following human instruction, on the other hand, is the most natural way toperform artistic style transfer that can significantly improve controllabilityfor visual effect applications. We introduce a new task -- language-drivenimage style transfer (\texttt{LDIST}) -- to manipulate the style of a contentimage, guided by a text. We propose contrastive language visual artist (CLVA)that learns to extract visual semantics from style instructions and accomplish\texttt{LDIST} by the patch-wise style discriminator. The discriminatorconsiders the correlation between language and patches of style images ortransferred results to jointly embed style instructions. CLVA further comparescontrastive pairs of content image and style instruction to improve the mutualrelativeness between transfer results. The transferred results from the samecontent image can preserve consistent content structures. Besides, they shouldpresent analogous style patterns from style instructions that contain similarvisual semantics. The experiments show that our CLVA is effective and achievessuperb transferred results on \texttt{LDIST}.