Abstract
Large language models (LLMs) achieve remarkable performance in codegeneration tasks. However, a significant performance disparity persists betweenpopular programming languages (e.g., Python, C++) and others. To address thiscapability gap, we leverage the code translation task to train LLMs, therebyfacilitating the transfer of coding proficiency across diverse programminglanguages. Moreover, we introduce OORL for training, a novel reinforcementlearning (RL) framework that integrates on-policy and off-policy strategies.Within OORL, on-policy RL is applied during code translation, guided by arule-based reward signal derived from unit tests. Complementing thiscoarse-grained rule-based reward, we propose Group Equivalent PreferenceOptimization (GEPO), a novel preference optimization method. Specifically, GEPOtrains the LLM using intermediate representations (IRs) groups. LLMs can beguided to discern IRs equivalent to the source code from inequivalent ones,while also utilizing signals about the mutual equivalence between IRs withinthe group. This process allows LLMs to capture nuanced aspects of codefunctionality. By employing OORL for training with code translation tasks, LLMsimprove their recognition of code functionality and their understanding of therelationships between code implemented in different languages. Extensiveexperiments demonstrate that our OORL for LLMs training with code translationtasks achieves significant performance improvements on code benchmarks acrossmultiple programming languages.