Abstract
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcementlearning, evaluating, and improving autonomous large language model (LLM)agents in iterative machine learning engineering (MLE) workflows. Unlikeexisting benchmarks that primarily rely on static datasets or single-attemptevaluations, MLE-Dojo provides an interactive environment enabling agents toiteratively experiment, debug, and refine solutions through structured feedbackloops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse,open-ended MLE tasks carefully curated to reflect realistic engineeringscenarios such as data processing, architecture search, hyperparameter tuning,and code debugging. Its fully executable environment supports comprehensiveagent training via both supervised fine-tuning and reinforcement learning,facilitating iterative experimentation, realistic data sampling, and real-timeoutcome verification. Extensive evaluations of eight frontier LLMs reveal thatwhile current models achieve meaningful iterative improvements, they stillexhibit significant limitations in autonomously generating long-horizonsolutions and efficiently resolving complex errors. Furthermore, MLE-Dojo'sflexible and extensible architecture seamlessly integrates diverse datasources, tools, and evaluation protocols, uniquely enabling model-based agenttuning and promoting interoperability, scalability, and reproducibility. Weopen-source our framework and benchmarks to foster community-driven innovationtowards next-generation MLE agents.