From Reasoning to Code: GRPO Optimization for Underrepresented Languages

Abstract

Generating accurate and executable code using large language models (LLMs) ischallenging for languages with limited public training data compared to popularlanguages such as Python. This paper introduces a generalizable approach thatuses small-scale code versions of the Qwen 2.5 model combined with GroupRelative Policy Optimization (GRPO) to enable effective code generation throughexplicit reasoning steps, which is particularly beneficial for languages withsmaller source code databases. Using Prolog as a representative use case --given its limited online presence -- the initial model faced challenges ingenerating executable code. After some training steps, the model successfullyproduces logically consistent and syntactically accurate code by directlyintegrating reasoning-driven feedback into the reinforcement learning loop.Experimental evaluations using mathematical logic problem benchmarks illustratesignificant improvements in reasoning quality, code accuracy, and logicalcorrectness, underscoring the potential of this approach to benefit a widerange of programming languages lacking extensive training resources.

Quick Read (beta)

loading the full paper ...