Abstract
We introduce DecompSR, decomposed spatial reasoning, a large benchmarkdataset (over 5m datapoints) and generation framework designed to analysecompositional spatial reasoning ability. The generation of DecompSR allowsusers to independently vary several aspects of compositionality, namely:productivity (reasoning depth), substitutivity (entity and linguisticvariability), overgeneralisation (input order, distractors) and systematicity(novel linguistic elements). DecompSR is built procedurally in a manner whichmakes it is correct by construction, which is independently verified using asymbolic solver to guarantee the correctness of the dataset. DecompSR iscomprehensively benchmarked across a host of Large Language Models (LLMs) wherewe show that LLMs struggle with productive and systematic generalisation inspatial reasoning tasks whereas they are more robust to linguistic variation.DecompSR provides a provably correct and rigorous benchmarking dataset with anovel ability to independently vary the degrees of several key aspects ofcompositionality, allowing for robust and fine-grained probing of thecompositional reasoning abilities of LLMs.