Abstract
Developing algorithms to differentiate between machine-generated texts andhuman-written texts has garnered substantial attention in recent years.Existing methods in this direction typically concern an offline setting where adataset containing a mix of real and machine-generated texts is given upfront,and the task is to determine whether each sample in the dataset is from a largelanguage model (LLM) or a human. However, in many practical scenarios, sourcessuch as news websites, social media accounts, or on other forums publishcontent in a streaming fashion. Therefore, in this online scenario, how toquickly and accurately determine whether the source is an LLM with strongstatistical guarantees is crucial for these media or platforms to functioneffectively and prevent the spread of misinformation and other potential misuseof LLMs. To tackle the problem of online detection, we develop an algorithmbased on the techniques of sequential hypothesis testing by betting that notonly builds upon and complements existing offline detection techniques but alsoenjoys statistical guarantees, which include a controlled false positive rateand the expected time to correctly identify a source as an LLM. Experimentswere conducted to demonstrate the effectiveness of our method.