Abstract
Rapid advancements in large language models (LLMs) have increased interest indeploying them on mobile devices for on-device AI applications. Mobile usersinteract differently with LLMs compared to desktop users, creating uniqueexpectations and data biases. Current benchmark datasets primarily target atserver and desktop environments, and there is a notable lack of extensivedatasets specifically designed for mobile contexts. Additionally, mobiledevices face strict limitations in storage and computing resources,constraining model size and capabilities, thus requiring optimized efficiencyand prioritized knowledge. To address these challenges, we introduceMobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence.It consists of 16,186 questions across 80 mobile-related fields, designed toevaluate LLM performance in realistic mobile scenarios. A challenging subset,Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro butsignificantly more difficult than our standard full set. Both benchmarks usemultiple-choice, order-invariant questions focused on practical mobileinteractions, such as recipe suggestions, travel planning, and essential dailytasks. The dataset emphasizes critical mobile-specific metrics like inferencelatency, energy consumption, memory usage, and response quality, offeringcomprehensive insights into model performance under mobile constraints.Moreover, it prioritizes privacy and adaptability, assessing models' ability toperform on-device processing, maintain user privacy, and adapt to personalizedusage patterns. Mobile-MMLU family offers a standardized framework fordeveloping and comparing mobile-optimized LLMs, enabling advancements inproductivity and decision-making within mobile computing environments. Our codeand data are available at: https://github.com/VILA-Lab/Mobile-MMLU.