Watermarking Language Models through Language Models

Abstract

This paper presents a novel framework for watermarking language modelsthrough prompts generated by language models. The proposed approach utilizes amulti-model setup, incorporating a Prompting language model to generatewatermarking instructions, a Marking language model to embed watermarks withingenerated content, and a Detecting language model to verify the presence ofthese watermarks. Experiments are conducted using ChatGPT and Mistral as thePrompting and Marking language models, with detection accuracy evaluated usinga pretrained classifier model. Results demonstrate that the proposed frameworkachieves high classification accuracy across various configurations, with 95%accuracy for ChatGPT, 88.79% for Mistral. These findings validate the andadaptability of the proposed watermarking strategy across different languagemodel architectures. Hence the proposed framework holds promise forapplications in content attribution, copyright protection, and modelauthentication.

Quick Read (beta)

loading the full paper ...