Abstract
Understanding the non-literal meaning of an utterance is critical for largelanguage models (LLMs) to become human-like social communicators. In this work,we introduce SwordsmanImp, the first Chinese multi-turn-dialogue-based datasetaimed at conversational implicature, sourced from dialogues in the Chinesesitcom $\textit{My Own Swordsman}$. It includes 200 carefully handcraftedquestions, all annotated on which Gricean maxims have been violated. We testeight close-source and open-source LLMs under two tasks: a multiple-choicequestion task and an implicature explanation task. Our results show that GPT-4attains human-level accuracy (94%) on multiple-choice questions. CausalLMdemonstrates a 78.5% accuracy following GPT-4. Other models, including GPT-3.5and several open-source models, demonstrate a lower accuracy ranging from 20%to 60% on multiple-choice questions. Human raters were asked to rate theexplanation of the implicatures generated by LLMs on their reasonability, logicand fluency. While all models generate largely fluent and self-consistent text,their explanations score low on reasonability except for GPT-4, suggesting thatmost LLMs cannot produce satisfactory explanations of the implicatures in theconversation. Moreover, we find LLMs' performance does not vary significantlyby Gricean maxims, suggesting that LLMs do not seem to process implicaturesderived from different maxims differently. Our data and code are available athttps://github.com/sjtu-compling/llm-pragmatics.