Abstract
The emergence of instruction-tuned large language models (LLMs) has advancedthe field of dialogue systems, enabling both realistic user simulations androbust multi-turn conversational agents. However, existing research oftenevaluates these components in isolation-either focusing on a single usersimulator or a specific system design-limiting the generalisability of insightsacross architectures and configurations. In this work, we propose clem todd(chat-optimized LLMs for task-oriented dialogue systems development), aflexible framework for systematically evaluating dialogue systems underconsistent conditions. clem todd enables detailed benchmarking acrosscombinations of user simulators and dialogue systems, whether existing modelsfrom literature or newly developed ones. It supports plug-and-play integrationand ensures uniform datasets, evaluation metrics, and computationalconstraints. We showcase clem todd's flexibility by re-evaluating existingtask-oriented dialogue systems within this unified setup and integrating threenewly proposed dialogue systems into the same evaluation pipeline. Our resultsprovide actionable insights into how architecture, scale, and promptingstrategies affect dialogue performance, offering practical guidance forbuilding efficient and effective conversational AI systems.