Most state-of-the-art semi-supervised video object segmentation methods relyon a pixel-accurate mask of a target object provided for the first frame of avideo. However, obtaining a detailed segmentation mask is expensive andtime-consuming. In this work we explore an alternative way of identifying atarget object, namely by employing language referring expressions. Besidesbeing a more practical and natural way of pointing out a target object, usinglanguage specifications can help to avoid drift as well as make the system morerobust to complex dynamics and appearance variations. Leveraging recentadvances of language grounding models designed for images, we propose anapproach to extend them to video data, ensuring temporally coherentpredictions. To evaluate our method we augment the popular video objectsegmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions oftarget objects. We show that our language-supervised approach performs on parwith the methods which have access to a pixel-level mask of the target objecton DAVIS'16 and is competitive to methods using scribbles on the challengingDAVIS'17 dataset.