Video editing tools are widely used nowadays for digital design. Although thedemand for these tools is high, the prior knowledge required makes it difficultfor novices to get started. Systems that could follow natural languageinstructions to perform automatic editing would significantly improveaccessibility. This paper introduces the language-based video editing (LBVE)task, which allows the model to edit, guided by text instruction, a sourcevideo into a target video. LBVE contains two features: 1) the scenario of thesource video is preserved instead of generating a completely different video;2) the semantic is presented differently in the target video, and all changesare controlled by the given instruction. We propose a Multi-Modal Multi-LevelTransformer (M$^3$L-Transformer) to carry out LBVE. The M$^3$L-Transformerdynamically learns the correspondence between video perception and languagesemantic at different levels, which benefits both the video understanding andvideo frame synthesis. We build three new datasets for evaluation, includingtwo diagnostic and one from natural videos with human-labeled text. Extensiveexperimental results show that M$^3$L-Transformer is effective for videoediting and that LBVE can lead to a new field toward vision-and-languageresearch.