In this paper, we are interested in editing text in natural images, whichaims to replace or modify a word in the source image with another one whilemaintaining its realistic look. This task is challenging, as the styles of bothbackground and text need to be preserved so that the edited image is visuallyindistinguishable from the source image. Specifically, we propose an end-to-endtrainable style retention network (SRNet) that consists of three modules: textconversion module, background inpainting module and fusion module. The textconversion module changes the text content of the source image into the targettext while keeping the original text style. The background inpainting moduleerases the original text, and fills the text region with appropriate texture.The fusion module combines the information from the two former modules, andgenerates the edited text images. To our knowledge, this work is the firstattempt to edit text in natural images at the word level. Both visual effectsand quantitative results on synthetic and real-world dataset (ICDAR 2013) fullyconfirm the importance and necessity of modular decomposition. We also conductextensive experiments to validate the usefulness of our method in variousreal-world applications such as text image synthesis, augmented reality (AR)translation, information hiding, etc.