An Aligning and Training Framework for Multimodal Recommendations

Abstract

With the development of multimedia applications, multimodal recommendationsare playing an essential role, as they can leverage rich contexts beyond userinteractions. Existing methods mainly regard multimodal information as anauxiliary, using them to help learn ID features; however, there exist semanticgaps among multimodal content features and ID features, for which directlyusing multimodal information as an auxiliary would lead to misalignment inrepresentations of users and items. In this paper, we first systematicallyinvestigate the misalignment issue in multimodal recommendations, and propose asolution named AlignRec. In AlignRec, the recommendation objective isdecomposed into three alignments, namely alignment within contents, alignmentbetween content and categorical ID, and alignment between users and items. Eachalignment is characterized by a specific objective function and is integratedinto our multimodal recommendation framework. To effectively train ourAlignRec, we propose starting from pre-training the first alignment to obtainunified multimodal features and subsequently training the following twoalignments together with these features as input. As it is essential to analyzewhether each multimodal feature helps in training, we design three new classesof metrics to evaluate intermediate performance. Our extensive experiments onthree real-world datasets consistently verify the superiority of AlignReccompared to nine baselines. We also find that the multimodal features generatedby AlignRec are better than currently used ones, which are to be open-sourced.

Quick Read (beta)

loading the full paper ...