Abstract
Query-product relevance prediction is a foundational technology in e-commerce search engines and has become increasingly important in AI-driven e-commerce. The recent emergence of LLMs, particularly their CoT reasoning capabilities, offers promising opportunities for developing relevance systems that are both more interpretable and more robust. However, existing training paradigms have notable limitations: SFT and DPO suffer from poor generalization on long-tail queries and from a lack of fine-grained, stepwise supervision to enforce rule-aligned reasoning. In contrast, reinforcement learning with verification rewards (RLVR) suffers from sparse feedback, which provides insufficient signal to correct erroneous intermediate steps, thereby undermining logical consistency and limiting performance in complex inference scenarios. To address these challenges, we introduce the Stepwise Hybrid Examination Reinforcement Learning framework for search relevance (SHE). At its core is Stepwise Reward Policy Optimization (SRPO), a reinforcement learning algorithm that leverages step-level rewards generated by a hybrid of a high-quality generative stepwise reward model and a human-annotated offline verifier, prioritizing learning from critical correct and incorrect reasoning steps. To bolster robustness and generalization, SHE further integrates a dual-strategy optimization: diversified data filtering, which broadens the exploration of reasoning trajectories to preempt policy entropy collapse, and a multi-stage curriculum learning protocol that systematically orchestrates progressive capability acquisition. Extensive experiments on real-world search benchmarks show that SHE improves both reasoning quality and relevance-prediction accuracy in large-scale e-commerce settings, outperforming SFT, DPO, GRPO, and other baselines, while also enhancing interpretability and robustness.