Standard reinforcement learning (RL) aims to find an optimal policy thatidentifies the best action for each state. However, in healthcare settings,many actions may be near-equivalent with respect to the reward (e.g.,survival). We consider an alternative objective -- learning set-valued policiesto capture near-equivalent actions that lead to similar cumulative rewards. Wepropose a model-free algorithm based on temporal difference learning and anear-greedy heuristic for action selection. We analyze the theoreticalproperties of the proposed algorithm, providing optimality guarantees anddemonstrate our approach on simulated environments and a real clinical task.Empirically, the proposed algorithm exhibits good convergence properties anddiscovers meaningful near-equivalent actions. Our work provides theoretical, aswell as practical, foundations for clinician/human-in-the-loop decision making,in which humans (e.g., clinicians, patients) can incorporate additionalknowledge (e.g., side effects, patient preference) when selecting amongnear-equivalent actions.