PokerBench: Training Large Language Models to become Professional Poker Players

Abstract

We introduce PokerBench - a benchmark for evaluating the poker-playingabilities of large language models (LLMs). As LLMs excel in traditional NLPtasks, their application to complex, strategic games like poker poses a newchallenge. Poker, an incomplete information game, demands a multitude of skillssuch as mathematics, reasoning, planning, strategy, and a deep understanding ofgame theory and human psychology. This makes Poker the ideal next frontier forlarge language models. PokerBench consists of a comprehensive compilation of11,000 most important scenarios, split between pre-flop and post-flop play,developed in collaboration with trained poker players. We evaluate prominentmodels including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models,finding that all state-of-the-art LLMs underperform in playing optimal poker.However, after fine-tuning, these models show marked improvements. We validatePokerBench by having models with different scores compete with each other,demonstrating that higher scores on PokerBench lead to higher win rates inactual poker games. Through gameplay between our fine-tuned model and GPT-4, wealso identify limitations of simple supervised fine-tuning for learning optimalplaying strategy, suggesting the need for more advanced methodologies foreffectively training language models to excel in games. PokerBench thuspresents a unique benchmark for a quick and reliable evaluation of thepoker-playing ability of LLMs as well as a comprehensive benchmark to study theprogress of LLMs in complex game-playing scenarios. The dataset and code willbe made available at: \url{https://github.com/pokerllm/pokerbench}.

Quick Read (beta)

loading the full paper ...