Scene text editing (STE), which converts a text in a scene image into thedesired text while preserving an original style, is a challenging task due to acomplex intervention between text and style. To address this challenge, wepropose a novel representational learning-based STE model, referred to asRewriteNet that employs textual information as well as visual information. Weassume that the scene text image can be decomposed into content and stylefeatures where the former represents the text information and style representsscene text characteristics such as font, alignment, and background. Under thisassumption, we propose a method to separately encode content and style featuresof the input image by introducing the scene text recognizer that is trained bytext information. Then, a text-edited image is generated by combining the stylefeature from the original image and the content feature from the target text.Unlike previous works that are only able to use synthetic images in thetraining phase, we also exploit real-world images by proposing aself-supervised training scheme, which bridges the domain gap between syntheticand real data. Our experiments demonstrate that RewriteNet achieves betterquantitative and qualitative performance than other comparisons. Moreover, wevalidate that the use of text information and the self-supervised trainingscheme improves text switching performance. The implementation and dataset willbe publicly available.