Abstract
The success of Large Language Models (LLMs) has sparked interest in variousagentic applications. A key hypothesis is that LLMs, leveraging common senseand Chain-of-Thought (CoT) reasoning, can effectively explore and efficientlysolve complex domains. However, LLM agents have been found to suffer fromsub-optimal exploration and the knowing-doing gap, the inability to effectivelyact on knowledge present in the model. In this work, we systematically studywhy LLMs perform sub-optimally in decision-making scenarios. In particular, weclosely examine three prevalent failure modes: greediness, frequency bias, andthe knowing-doing gap. We propose mitigation of these shortcomings byfine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales.Our experiments across multi-armed bandits, contextual bandits, andTic-tac-toe, demonstrate that RL fine-tuning enhances the decision-makingabilities of LLMs by increasing exploration and narrowing the knowing-doinggap. Finally, we study both classic exploration mechanisms, such as$\epsilon$-greedy, and LLM-specific approaches, such as self-correction andself-consistency, to enable more effective fine-tuning of LLMs fordecision-making.