Multi-Domain Explainability of Preferences

Abstract

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), andreward models, are central to aligning and evaluating large language models(LLMs). Yet, the underlying concepts that drive these preferences remain poorlyunderstood. In this work, we propose a fully automated end-to-end method forgenerating local and global concept-based explanations of preferences acrossmultiple domains. Our method employs an LLM to discover concepts thatdifferentiate between chosen and rejected responses and represent them withconcept-based vectors. To model the relationships between concepts andpreferences, we propose a white-box Hierarchical Multi-Domain Regression modelthat captures both domain-general and domain-specific effects. To evaluate ourmethod, we curate a dataset spanning eight challenging and diverse domains andexplain twelve mechanisms. Our method achieves strong preference predictionperformance, outperforming baselines while also being explainable.Additionally, we assess explanations in two novel application-driven settings.First, guiding LLM outputs with concepts from LaaJ explanations yieldsresponses that those judges consistently prefer. Second, prompting LaaJs withconcepts explaining humans improves their preference predictions. Together, ourwork provides a new paradigm for explainability in the era of LLMs.

Quick Read (beta)

loading the full paper ...