A Watermark for Large Language Models

Abstract

Potential harms of large language models can be mitigated by watermarkingmodel output, i.e., embedding signals into generated text that are invisible tohumans but algorithmically detectable from a short span of tokens. We propose awatermarking framework for proprietary language models. The watermark can beembedded with negligible impact on text quality, and can be detected using anefficient open-source algorithm without access to the language model API orparameters. The watermark works by selecting a randomized set of whitelisttokens before a word is generated, and then softly promoting use of whitelisttokens during sampling. We propose a statistical test for detecting thewatermark with interpretable p-values, and derive an information-theoreticframework for analyzing the sensitivity of the watermark. We test the watermarkusing a multi-billion parameter model from the Open Pretrained Transformer(OPT) family, and discuss robustness and security.

Quick Read (beta)

loading the full paper ...