Abstract
The introduction of ChatGPT has garnered widespread attention in bothacademic and industrial communities. ChatGPT is able to respond effectively toa wide range of human questions, providing fluent and comprehensive answersthat significantly surpass previous public chatbots in terms of security andusefulness. On one hand, people are curious about how ChatGPT is able toachieve such strength and how far it is from human experts. On the other hand,people are starting to worry about the potential negative impacts that largelanguage models (LLMs) like ChatGPT could have on society, such as fake news,plagiarism, and social security issues. In this work, we collected tens ofthousands of comparison responses from both human experts and ChatGPT, withquestions ranging from open-domain, financial, medical, legal, andpsychological areas. We call the collected dataset the Human ChatGPT ComparisonCorpus (HC3). Based on the HC3 dataset, we study the characteristics ofChatGPT's responses, the differences and gaps from human experts, and futuredirections for LLMs. We conducted comprehensive human evaluations andlinguistic analyses of ChatGPT-generated content compared with that of humans,where many interesting results are revealed. After that, we conduct extensiveexperiments on how to effectively detect whether a certain text is generated byChatGPT or humans. We build three different detection systems, explore severalkey factors that influence their effectiveness, and evaluate them in differentscenarios. The dataset, code, and models are all publicly available athttps://github.com/Hello-SimpleAI/chatgpt-comparison-detection.