Abstract
Recent advancements in LLMs indicate potential for novel applications, e.g.,through reasoning capabilities in the latest OpenAI and DeepSeek models. Forapplying these models in specific domains beyond text generation, LLM-basedmulti-agent approaches can be utilized that solve complex tasks by combiningreasoning techniques, code generation, and software execution. Applicationsmight utilize these capabilities and the knowledge of specialized LLM agents.However, while many evaluations are performed on LLMs, reasoning techniques,and applications individually, their joint specification and combinedapplication is not explored well. Defined specifications for multi-agent LLMsystems are required to explore their potential and their suitability forspecific applications, allowing for systematic evaluations of LLMs, reasoningtechniques, and related aspects. This paper reports the results of exploratoryresearch to specify and evaluate these aspects through a multi-agent system.The system architecture and prototype are extended from previous research and aspecification is introduced for multi-agent systems. Test cases involvingcybersecurity tasks indicate feasibility of the architecture and evaluationapproach. In particular, the results show the evaluation of question answering,server security, and network security tasks that were completed correctly byagents with LLMs from OpenAI and DeepSeek.