OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Abstract

Information Technology (IT) Operations (Ops), particularly ArtificialIntelligence for IT Operations (AIOps), is the guarantee for maintaining theorderly and stable operation of existing information systems. According toGartner's prediction, the use of AI technology for automated IT operations hasbecome a new trend. Large language models (LLMs) that have exhibited remarkablecapabilities in NLP-related tasks, are showing great potential in the field ofAIOps, such as in aspects of root cause analysis of failures, generation ofoperations and maintenance scripts, and summarizing of alert information.Nevertheless, the performance of current LLMs in Ops tasks is yet to bedetermined. In this paper, we present OpsEval, a comprehensive task-orientedOps benchmark designed for LLMs. For the first time, OpsEval assesses LLMs'proficiency in various crucial scenarios at different ability levels. Thebenchmark includes 7184 multi-choice questions and 1736 question-answering (QA)formats in English and Chinese. By conducting a comprehensive performanceevaluation of the current leading large language models, we show how variousLLM techniques can affect the performance of Ops, and discussed findingsrelated to various topics, including model quantification, QA evaluation, andhallucination issues. To ensure the credibility of our evaluation, we invitedozens of domain experts to manually review our questions. At the same time, wehave open-sourced 20% of the test QA to assist current researchers inpreliminary evaluations of their OpsLLM models. The remaining 80% of the data,which is not disclosed, is used to eliminate the issue of the test set leakage.Additionally, we have constructed an online leaderboard that is updated inreal-time and will continue to be updated, ensuring that any newly emergingLLMs will be evaluated promptly. Both our dataset and leaderboard have beenmade public.

Quick Read (beta)

loading the full paper ...