Abstract
While large language models have shown exciting progress on several NLPbenchmarks, evaluating their ability for complex analogical reasoning remainsunder-explored. Here, we introduce a high-quality crowdsourced dataset ofnarratives for employing proverbs in context as a benchmark for abstractlanguage understanding. The dataset provides fine-grained annotation of alignedspans between proverbs and narratives, and contains minimal lexical overlapsbetween narratives and proverbs, ensuring that models need to go beyondsurface-level reasoning to succeed. We explore three tasks: (1) proverbrecommendation and alignment prediction, (2) narrative generation for a givenproverb and topic, and (3) identifying narratives with similar motifs. Ourexperiments show that neural language models struggle in our tasks compared tohumans, and the tasks pose multiple learning challenges.