Abstract
Large Language Models (LLMs) have demonstrated strong performance on taskswith short time frames, but struggle with tasks requiring longer durations.While datasets covering extended-duration tasks, such as software engineeringtasks or video games, do exist, there are currently few implementations ofcomplex board games specifically designed for reinforcement learning and LLMevaluation. To address this gap, we propose the 4Hammer reinforcement learningenvironment, a digital twin simulation of a subset of Warhammer 40,000-acomplex, zero-sum board game. Warhammer 40,000 features intricate rules,requiring human players to thoroughly read and understand over 50 pages ofdetailed natural language rules, grasp the interactions between their gamepieces and those of their opponents, and independently track and communicatethe evolving game state.