Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Abstract

In generative commonsense reasoning tasks such as CommonGen, generative largelanguage models (LLMs) compose sentences that include all given concepts.However, when focusing on instruction-following capabilities, if a promptspecifies a concept order, LLMs must generate sentences that adhere to thespecified order. To address this, we propose Ordered CommonGen, a benchmarkdesigned to evaluate the compositional generalization and instruction-followingabilities of LLMs. This benchmark measures ordered coverage to assess whetherconcepts are generated in the specified order, enabling a simultaneousevaluation of both abilities. We conducted a comprehensive analysis using 36LLMs and found that, while LLMs generally understand the intent ofinstructions, biases toward specific concept order patterns often lead tolow-diversity outputs or identical results even when the concept order isaltered. Moreover, even the most instruction-compliant LLM achieved only about75% ordered coverage, highlighting the need for improvements in bothinstruction-following and compositional generalization capabilities.

Quick Read (beta)

loading the full paper ...