Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

  • 2025-06-18 18:00:54
  • Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
  • 0

Abstract

In generative commonsense reasoning tasks such as CommonGen, generative largelanguage models (LLMs) compose sentences that include all given concepts.However, when focusing on instruction-following capabilities, if a promptspecifies a concept order, LLMs must generate sentences that adhere to thespecified order. To address this, we propose Ordered CommonGen, a benchmarkdesigned to evaluate the compositional generalization and instruction-followingabilities of LLMs. This benchmark measures ordered coverage to assess whetherconcepts are generated in the specified order, enabling a simultaneousevaluation of both abilities. We conducted a comprehensive analysis using 36LLMs and found that, while LLMs generally understand the intent ofinstructions, biases toward specific concept order patterns often lead tolow-diversity outputs or identical results even when the concept order isaltered. Moreover, even the most instruction-compliant LLM achieved only about75% ordered coverage, highlighting the need for improvements in bothinstruction-following and compositional generalization capabilities.

 

Quick Read (beta)

loading the full paper ...