Abstract
Language representation learning has emerged as a promising approach forsequential recommendation, thanks to its ability to learn generalizablerepresentations. However, despite its advantages, this approach still struggleswith data sparsity and a limited understanding of common-sense userpreferences. To address these limitations, we propose $\textbf{JEPA4Rec}$, aframework that combines $\textbf{J}$oint $\textbf{E}$mbedding$\textbf{P}$redictive $\textbf{A}$rchitecture with language modeling of itemtextual descriptions. JEPA4Rec captures semantically rich and transferablerepresentations, improving recommendation performance and reducing reliance onlarge-scale pre-training data. Specifically, JEPA4Rec represents items as textsentences by flattening descriptive information such as $\textit{title,category}$, and other attributes. To encode these sentences, we employ abidirectional Transformer encoder with modified embedding layers tailored forcapturing item information in recommendation datasets. We apply masking to textsentences and use them to predict the representations of the unmaskedsentences, helping the model learn generalizable item embeddings. To furtherimprove recommendation performance and language understanding, we employ atwo-stage training strategy incorporating self-supervised learning losses.Experiments on six real-world datasets demonstrate that JEPA4Rec consistentlyoutperforms state-of-the-art methods, particularly in cross-domain,cross-platform, and low-resource scenarios.