Abstract
Artificial Intelligence (AI)-powered features have rapidly proliferatedacross mobile apps in various domains, including productivity, education,entertainment, and creativity. However, how users perceive, evaluate, andcritique these AI features remains largely unexplored, primarily due to theoverwhelming volume of user feedback. In this work, we present the firstcomprehensive, large-scale study of user feedback on AI-powered mobile apps,leveraging a curated dataset of 292 AI-driven apps across 14 categories with894K AI-specific reviews from Google Play. We develop and validate amulti-stage analysis pipeline that begins with a human-labeled benchmark andsystematically evaluates large language models (LLMs) and prompting strategies.Each stage, including review classification, aspect-sentiment extraction, andclustering, is validated for accuracy and consistency. Our pipeline enablesscalable, high-precision analysis of user feedback, extracting over one millionaspect-sentiment pairs clustered into 18 positive and 15 negative user topics.Our analysis reveals that users consistently focus on a narrow set of themes:positive comments emphasize productivity, reliability, and personalizedassistance, while negative feedback highlights technical failures (e.g.,scanning and recognition), pricing concerns, and limitations in languagesupport. Our pipeline surfaces both satisfaction with one feature andfrustration with another within the same review. These fine-grained,co-occurring sentiments are often missed by traditional approaches that treatpositive and negative feedback in isolation or rely on coarse-grained analysis.To this end, our approach provides a more faithful reflection of the real-worlduser experiences with AI-powered apps. Category-aware analysis further uncoversboth universal drivers of satisfaction and domain-specific frustrations.