Abstract
Language models today are widely used in education, yet their ability totailor responses for learners with varied informational needs and knowledgebackgrounds remains under-explored. To this end, we introduce ELI-Why, abenchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities oflanguage models. We then conduct two extensive human studies to assess theutility of language model-generated explanatory answers (explanations) on ourbenchmark, tailored to three distinct educational grades: elementary,high-school and graduate school. In our first study, human raters assume therole of an "educator" to assess model explanations' fit to differenteducational grades. We find that GPT-4-generated explanations match theirintended educational background only 50% of the time, compared to 79% for layhuman-curated explanations. In our second study, human raters assume the roleof a learner to assess if an explanation fits their own informational needs.Across all educational backgrounds, users deemed GPT-4-generated explanations20% less suited on average to their informational needs, when compared toexplanations curated by lay people. Additionally, automated evaluation metricsreveal that explanations generated across different language model families fordifferent informational needs remain indistinguishable in their grade-level,limiting their pedagogical effectiveness.