Abstract
This paper presents a comprehensive evaluation of the capabilities of LargeLanguage Models (LLMs) in metaphor interpretation across multiple datasets,tasks, and prompt configurations. Although metaphor processing has gainedsignificant attention in Natural Language Processing (NLP), previous researchhas been limited to single-dataset evaluations and specific task settings,often using artificially constructed data through lexical replacement. Weaddress these limitations by conducting extensive experiments using diversepublicly available datasets with inference and metaphor annotations, focusingon Natural Language Inference (NLI) and Question Answering (QA) tasks. Theresults indicate that LLMs' performance is more influenced by features likelexical overlap and sentence length than by metaphorical content, demonstratingthat any alleged emergent abilities of LLMs to understand metaphorical languageare the result of a combination of surface-level features, in-context learning,and linguistic knowledge. This work provides critical insights into the currentcapabilities and limitations of LLMs in processing figurative language,highlighting the need for more realistic evaluation frameworks in metaphorinterpretation tasks. Data and code are publicly available.