When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text

Abstract

Detecting AI-generated text is a difficult problem to begin with; detectingAI-generated text on social media is made even more difficult due to the shorttext length and informal, idiosyncratic language of the internet. It isnonetheless important to tackle this problem, as social media represents asignificant attack vector in online influence campaigns, which may be bolsteredthrough the use of mass-produced AI-generated posts supporting (or opposing)particular policies, decisions, or events. We approach this problem with themindset and resources of a reasonably sophisticated threat actor, and create adataset of 505,159 AI-generated social media posts from a combination ofopen-source, closed-source, and fine-tuned LLMs, covering 11 differentcontroversial topics. We show that while the posts can be detected undertypical research assumptions about knowledge of and access to the generatingmodels, under the more realistic assumption that an attacker will not releasetheir fine-tuned model to the public, detectability drops dramatically. Thisresult is confirmed with a human study. Ablation experiments highlight thevulnerability of various detection algorithms to fine-tuned LLMs. This resulthas implications across all detection domains, since fine-tuning is a generallyapplicable and realistic LLM use case.

Quick Read (beta)

loading the full paper ...