Abstract
Recent research on explainable recommendation generally frames the task as astandard text generation problem, and evaluates models simply based on thetextual similarity between the predicted and ground-truth explanations.However, this approach fails to consider one crucial aspect of the systems:whether their outputs accurately reflect the users' (post-purchase) sentiments,i.e., whether and why they would like and/or dislike the recommended items. Toshed light on this issue, we introduce new datasets and evaluation methods thatfocus on the users' sentiments. Specifically, we construct the datasets byexplicitly extracting users' positive and negative opinions from theirpost-purchase reviews using an LLM, and propose to evaluate systems based onwhether the generated explanations 1) align well with the users' sentiments,and 2) accurately identify both positive and negative opinions of users on thetarget items. We benchmark several recent models on our datasets anddemonstrate that achieving strong performance on existing metrics does notensure that the generated explanations align well with the users' sentiments.Lastly, we find that existing models can provide more sentiment-awareexplanations when the users' (predicted) ratings for the target items aredirectly fed into the models as input. The datasets and benchmarkimplementation are available at: https://github.com/jchanxtarov/sent_xrec.