Abstract
As the capabilities of chatbots and their underlying LLMs continue todramatically improve, evaluating their performance has increasingly become amajor blocker to their further development. A major challenge is the availablebenchmarking datasets, which are largely static, outdated, and lacking inmultilingual coverage, limiting their ability to capture subtle linguistic andcultural variations. This paper introduces MEDAL, an automated multi-agentframework for generating, evaluating, and curating more representative anddiverse open-domain dialogue evaluation benchmarks. Our approach leveragesseveral state-of-the-art LLMs to generate user-chatbot multilingual dialogues,conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for amultidimensional analysis of the performance of the chatbots, uncoveringnoticeable cross-lingual performance differences. Guided by this large-scaleevaluation, we curate a new meta-evaluation multilingual benchmark andhuman-annotate samples with nuanced quality judgments. This benchmark is thenused to assess the ability of several reasoning and non-reasoning LLMs to actas evaluators of open-domain dialogues. We find that current LLMs struggle todetect nuanced issues, particularly those involving empathy and reasoning.