Abstract
Reinforcement learning (RL) in continuous action spaces encounters persistentchallenges, such as inefficient exploration and convergence to suboptimalsolutions. To address these limitations, we propose CAMEL, a novel frameworkintegrating LLM-generated suboptimal policies into the RL training pipeline.CAMEL leverages dynamic action masking and an adaptive epsilon-maskingmechanism to guide exploration during early training stages while graduallyenabling agents to optimize policies independently. At the core of CAMEL liesthe integration of Python-executable suboptimal policies generated by LLMsbased on environment descriptions and task objectives. Although simplistic andhard-coded, these policies offer valuable initial guidance for RL agents. Toeffectively utilize these priors, CAMEL employs masking-aware optimization todynamically constrain the action space based on LLM outputs. Additionally,epsilon-masking gradually reduces reliance on LLM-generated guidance, enablingagents to transition from constrained exploration to autonomous policyrefinement. Experimental validation on Gymnasium MuJoCo environmentsdemonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generatedpolicies significantly improve sample efficiency, achieving performancecomparable to or surpassing expert masking baselines. For Walker2d-v4, whereLLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robustRL performance without notable degradation, highlighting the framework'sadaptability across diverse tasks. While CAMEL shows promise in enhancingsample efficiency and mitigating convergence challenges, these issues remainopen for further research. Future work aims to generalize CAMEL to multimodalLLMs for broader observation-action spaces and automate policy evaluation,reducing human intervention and enhancing scalability in RL training pipelines.