Investigating Training Data Detection in AI Coders

Abstract

Recent advances in code large language models (CodeLLMs) have made themindispensable tools in modern software engineering. However, these modelsoccasionally produce outputs that contain proprietary or sensitive codesnippets, raising concerns about potential non-compliant use of training data,and posing risks to privacy and intellectual property. To ensure responsibleand compliant deployment of CodeLLMs, training data detection (TDD) has becomea critical task. While recent TDD methods have shown promise in naturallanguage settings, their effectiveness on code data remains largelyunderexplored. This gap is particularly important given code's structuredsyntax and distinct similarity criteria compared to natural language. Toaddress this, we conduct a comprehensive empirical study of sevenstate-of-the-art TDD methods on source code data, evaluating their performanceacross eight CodeLLMs. To support this evaluation, we introduce CodeSnitch, afunction-level benchmark dataset comprising 9,000 code samples in threeprogramming languages, each explicitly labeled as either included or excludedfrom CodeLLM training. Beyond evaluation on the original CodeSnitch, we designtargeted mutation strategies to test the robustness of TDD methods under threedistinct settings. These mutation strategies are grounded in thewell-established Type-1 to Type-4 code clone detection taxonomy. Our studyprovides a systematic assessment of current TDD techniques for code and offersinsights to guide the development of more effective and robust detectionmethods in the future.

Quick Read (beta)

loading the full paper ...