Abstract
Despite high benchmark scores, Large Language Models (LLMs) often fail simpleproblem, raising a critical question: Do LLMs learn mathematical principles ormerely memorize patterns? Rather than designing increasingly complex benchmarkslike recent works, we investigate this using elementary two-integer addition($0$ to $2^{64}$), probing two core properties: commutativity ($A+B=B+A$) andcompositional generalization (via isomorphic symbolic mappings, e.g., $7\rightarrow y$). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy onnumerical addition, performance collapses to $\leq$7.5\% under symbolicmapping, indicating failure to generalize learned rules. Non-monotonicperformance scaling with digit count and frequent commutativity violations(over 1,700 cases of $A+B \neq B+A$) further support this. Explicitly providingaddition rules degrades performance by 81.2\% on average, whileself-explanation maintains baseline accuracy, suggesting LLM arithmeticprocessing is misaligned with human-defined principles. Our findings indicatecurrent LLMs rely on memory pattern over genuine rule learning, highlightingarchitectural limitations and the need for new approaches to achieve truemathematical reasoning.