Process-Level Trajectory Evaluation for Environment Configuration in Software Engineering Agents

Abstract

Large language model-based agents show promise for software engineering, butenvironment configuration remains a bottleneck due to heavy manual effort andscarce large-scale, high-quality datasets. Existing benchmarks assess onlyend-to-end build/test success, obscuring where and why agents succeed or fail.We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench,which provides process-level trajectory assessment of fine-grained agentcapabilities during environment setup-planning, perception-driven errordiagnosis, feedback-driven repair, and action to execute final environmentconfiguration. Our task instances are automatically constructed by injectingrealistic README errors and are validated in Docker for scalable, high-qualityevaluation. Enconda-bench combines process-level analysis with end-to-endexecutability to enable capability assessments beyond aggregate success rates.Evaluations across state-of-the-art LLMs and agent frameworks show that whileagents can localize errors, they struggle to translate feedback into effectivecorrections, limiting end-to-end performance. To our knowledge, Enconda-benchis the first framework to provide process-level internal capability assessmentfor environment configuration, offering actionable insights for improvingsoftware engineering agents.

Quick Read (beta)

loading the full paper ...