Abstract
The suite of datasets commonly used to train and evaluate the mathematicalcapabilities of AI-based mathematical copilots (primarily large languagemodels) exhibit several shortcomings. These limitations include a restrictedscope of mathematical complexity, typically not exceeding lowerundergraduate-level mathematics, binary rating protocols and other issues,which makes comprehensive proof-based evaluation suites difficult. Wesystematically explore these limitations and contend that enhancing thecapabilities of large language models, or any forthcoming advancements inAI-based mathematical assistants (copilots or "thought partners"), necessitatesa paradigm shift in the design of mathematical datasets and the evaluationcriteria of mathematical ability: It is necessary to move away fromresult-based datasets (theorem statement to theorem proof) and convert the richfacets of mathematical research practice to data LLMs can train on. Examples ofthese are mathematical workflows (sequences of atomic, potentiallysubfield-dependent tasks that are often performed when creating newmathematics), which are an important part of the proof-discovery process.Additionally, we advocate for mathematical dataset developers to consider theconcept of "motivated proof", introduced by G. P\'olya in 1949, which can serveas a blueprint for datasets that offer a better proof learning signal,alleviating some of the mentioned limitations. Lastly, we introduce mathdatasheets for datasets, extending the general, dataset-agnostic variants ofdatasheets: We provide a questionnaire designed specifically for math datasetsthat we urge dataset creators to include with their datasets. This will makecreators aware of potential limitations of their datasets while at the sametime making it easy for readers to assess it from the point of view of trainingand evaluating mathematical copilots.