Abstract
Autonomous vehicles (AV) offer a cost-effective solution for scientificmissions such as underwater tracking. Recently, reinforcement learning (RL) hasemerged as a powerful method for controlling AVs in complex marineenvironments. However, scaling these techniques to a fleet--essential formulti-target tracking or targets with rapid, unpredictable motion--presentssignificant computational challenges. Multi-Agent Reinforcement Learning (MARL)is notoriously sample-inefficient, and while high-fidelity simulators likeGazebo's LRAUV provide 100x faster-than-real-time single-robot simulations,they offer no significant speedup for multi-vehicle scenarios, making MARLtraining impractical. To address these limitations, we propose an iterativedistillation method that transfers high-fidelity simulations into a simplified,GPU-accelerated environment while preserving high-level dynamics. This approachachieves up to a 30,000x speedup over Gazebo through parallelization, enablingefficient training via end-to-end GPU acceleration. Additionally, we introducea novel Transformer-based architecture (TransfMAPPO) that learns multi-agentpolicies invariant to the number of agents and targets, significantly improvingsample efficiency. Following large-scale curriculum learning conducted entirelyon GPU, we perform extensive evaluations in Gazebo, demonstrating that ourmethod maintains tracking errors below 5 meters over extended durations, evenin the presence of multiple fast-moving targets. This work bridges the gapbetween large-scale MARL training and high-fidelity deployment, providing ascalable framework for autonomous fleet control in real-world sea missions.