Abstract
Offline reinforcement learning refers to the process of learning policiesfrom fixed datasets, without requiring additional environment interaction.However, it often relies on well-defined reward functions, which are difficultand expensive to design. Human feedback is an appealing alternative, but itstwo common forms, expert demonstrations and preferences, have complementarylimitations. Demonstrations provide stepwise supervision, but they are costlyto collect and often reflect limited expert behavior modes. In contrast,preferences are easier to collect, but it is unclear which parts of a behaviorcontribute most to a trajectory segment, leaving credit assignment unresolved.In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme tounify these two feedback sources. For each transition in a preference labeledtrajectory, SPW searches for the most similar state-action pairs from expertdemonstrations and directly derives stepwise importance weights based on theirsimilarity scores. These weights are then used to guide standard preferencelearning, enabling more accurate credit assignment that traditional approachesstruggle to achieve. We demonstrate that SPW enables effective joint learningfrom preferences and demonstrations, outperforming prior methods that leverageboth feedback types on challenging robot manipulation tasks.