Abstract
Natural language interfaces offer a compelling approach for musicrecommendation, enabling users to express complex preferences conversationally.While Large Language Models (LLMs) show promise in this direction, theirscalability in recommender systems is limited by high costs and latency.Retrieval-based approaches using smaller language models mitigate these issuesbut often rely on single-modal item representations, overlook long-term userpreferences, and require full model retraining, posing challenges forreal-world deployment. In this paper, we present JAM (Just Ask for Music), alightweight and intuitive framework for natural language music recommendation.JAM models user-query-item interactions as vector translations in a sharedlatent space, inspired by knowledge graph embedding methods like TransE. Tocapture the complexity of music and user intent, JAM aggregates multimodal itemfeatures via cross-attention and sparse mixture-of-experts. We also introduceJAMSessions, a new dataset of over 100k user-query-item triples with anonymizeduser/item embeddings, uniquely combining conversational queries and userlong-term preferences. Our results show that JAM provides accuraterecommendations, produces intuitive representations suitable for practical usecases, and can be easily integrated with existing music recommendation stacks.