SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

Abstract

Existing approaches to multilingual text detoxification are hampered by thescarcity of parallel multilingual datasets. In this work, we introduce apipeline for the generation of multilingual parallel detoxification data. Wealso introduce SynthDetoxM, a manually collected and synthetically generatedmultilingual parallel text detoxification dataset comprising 16,000high-quality detoxification sentence pairs across German, French, Spanish andRussian. The data was sourced from different toxicity evaluation datasets andthen rewritten with nine modern open-source LLMs in few-shot setting. Ourexperiments demonstrate that models trained on the produced synthetic datasetshave superior performance to those trained on the human-annotatedMultiParaDetox dataset even in data limited setting. Models trained onSynthDetoxM outperform all evaluated LLMs in few-shot setting. We release ourdataset and code to help further research in multilingual text detoxification.

Quick Read (beta)

loading the full paper ...