Abstract
While state-of-the-art language models excel at the style transfer task,current work does not address explainability of style transfer systems.Explanations could be generated using large language models such as GPT-3.5 andGPT-4, but the use of such complex systems is inefficient when smaller, widelydistributed, and transparent alternatives are available. We propose a frameworkto augment and improve a formality style transfer dataset with explanations viamodel distillation from ChatGPT. To further refine the generated explanations,we propose a novel way to incorporate scarce expert human feedback usingin-context learning (ICLEF: In-Context Learning from Expert Feedback) byprompting ChatGPT to act as a critic to its own outputs. We use the resultingdataset of 9,960 explainable formality style transfer instances (e-GYAFC) toshow that current openly distributed instruction-tuned models (and, in somesettings, ChatGPT) perform poorly on the task, and that fine-tuning on ourhigh-quality dataset leads to significant improvements as shown by automaticevaluation. In human evaluation, we show that models much smaller than ChatGPTfine-tuned on our data align better with expert preferences. Finally, wediscuss two potential applications of models fine-tuned on the explainablestyle transfer task: interpretable authorship verification and interpretableadversarial attacks on AI-generated text detectors.