Abstract
This study investigates the performance of the DeepSeek R1 language model on30 challenging mathematical problems derived from the MATH dataset, problemsthat previously proved unsolvable by other models under time constraints.Unlike prior work, this research removes time limitations to explore whetherDeepSeek R1's architecture, known for its reliance on token-based reasoning,can achieve accurate solutions through a multi-step process. The study comparesDeepSeek R1 with four other models (gemini-1.5-flash-8b,gpt-4o-mini-2024-07-18, llama3.1:8b, and mistral-8b-latest) across 11temperature settings. Results demonstrate that DeepSeek R1 achieves superioraccuracy on these complex problems but generates significantly more tokens thanother models, confirming its token-intensive approach. The findings highlight atrade-off between accuracy and efficiency in mathematical problem-solving withlarge language models: while DeepSeek R1 excels in accuracy, its reliance onextensive token generation may not be optimal for applications requiring rapidresponses. The study underscores the importance of considering task-specificrequirements when selecting an LLM and emphasizes the role of temperaturesettings in optimizing performance.