Large language models (LLMs) have empowered intelligent agents to executeintricate tasks within domain-specific software such as browsers and games.However, when applied to general-purpose software systems like operatingsystems, LLM agents face three primary challenges. Firstly, the action space isvast and dynamic, posing difficulties for LLM agents to maintain an up-to-dateunderstanding and deliver accurate responses. Secondly, real-world tasks oftenrequire inter-application cooperation}, demanding farsighted planning from LLMagents. Thirdly, agents need to identify optimal solutions aligning with userconstraints, such as security concerns and preferences. These challengesmotivate AndroidArena, an environment and benchmark designed to evaluate LLMagents on a modern operating system. To address high-cost of manpower, wedesign a scalable and semi-automated method to construct the benchmark. In thetask evaluation, AndroidArena incorporates accurate and adaptive metrics toaddress the issue of non-unique solutions. Our findings reveal that evenstate-of-the-art LLM agents struggle in cross-APP scenarios and adhering tospecific constraints. Additionally, we identify a lack of four keycapabilities, i.e., understanding, reasoning, exploration, and reflection, asprimary reasons for the failure of LLM agents. Furthermore, we provideempirical analysis on the failure of reflection, and improve the success rateby 27% with our proposed exploration strategy. This work is the first topresent valuable insights in understanding fine-grained weakness of LLM agents,and offers a path forward for future research in this area. Environment,benchmark, and evaluation code for AndroidArena are released athttps://github.com/AndroidArenaAgent/AndroidArena.