Abstract
Large Language Models (LLMs) are being deployed across various domains today.However, their capacity to solve Capture the Flag (CTF) challenges incybersecurity has not been thoroughly evaluated. To address this, we develop anovel method to assess LLMs in solving CTF challenges by creating a scalable,open-source benchmark database specifically designed for these applications.This database includes metadata for LLM testing and adaptive learning,compiling a diverse range of CTF challenges from popular competitions.Utilizing the advanced function calling capabilities of LLMs, we build a fullyautomated system with an enhanced workflow and support for external tool calls.Our benchmark dataset and automated framework allow us to evaluate theperformance of five LLMs, encompassing both black-box and open-source models.This work lays the foundation for future research into improving the efficiencyof LLMs in interactive cybersecurity tasks and automated task planning. Byproviding a specialized dataset, our project offers an ideal platform fordeveloping, testing, and refining LLM-based approaches to vulnerabilitydetection and resolution. Evaluating LLMs on these challenges and comparingwith human performance yields insights into their potential for AI-drivencybersecurity solutions to perform real-world threat management. We make ourdataset open source to public https://github.com/NYU-LLM-CTF/LLM_CTF_Databasealong with our playground automated frameworkhttps://github.com/NYU-LLM-CTF/llm_ctf_automation.