Abstract
Recent large language models (LLMs) have demonstrated remarkablegeneralization abilities in mathematics and logical reasoning tasks. Priorresearch indicates that LLMs pre-trained with programming language data exhibithigh mathematical and reasoning abilities; however, this causal relationshiphas not been rigorously tested. Our research aims to verify which programminglanguages and features during pre-training affect logical inferenceperformance. Specifically, we pre-trained decoder-based language models fromscratch using datasets from ten programming languages (e.g., Python, C, Java)and three natural language datasets (Wikipedia, Fineweb, C4) under identicalconditions. Thereafter, we evaluated the trained models in a few-shotin-context learning setting on logical reasoning tasks: FLD and bAbi, which donot require commonsense or world knowledge. The results demonstrate that nearlyall models trained with programming languages consistently outperform thosetrained with natural languages, indicating that programming languages containfactors that elicit logic inference performance. In addition, we found thatmodels trained with programming languages exhibit a better ability to followinstructions compared to those trained with natural languages. Further analysisreveals that the depth of Abstract Syntax Trees representing parsed results ofprograms also affects logical reasoning performance. These findings will offerinsights into the essential elements of pre-training for acquiring thefoundational abilities of LLMs.