OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs

  • 2025-05-25 15:30:15
  • Debdeep Sanyal Umakanta Maharana, Yash Sinha, Hong Ming Tan, Shirish Karande, Mohan Kankanhalli, Murari Mandal
  • 0

Abstract

Role-based access control (RBAC) and hierarchical structures are foundationalto how information flows and decisions are made within virtually allorganizations. As the potential of Large Language Models (LLMs) to serve asunified knowledge repositories and intelligent assistants in enterprisesettings becomes increasingly apparent, a critical, yet under explored,challenge emerges: \textit{can these models reliably understand and operatewithin the complex, often nuanced, constraints imposed by organizationalhierarchies and associated permissions?} Evaluating this crucial capability isinherently difficult due to the proprietary and sensitive nature of real-worldcorporate data and access control policies. We introduce a synthetic yetrepresentative \textbf{OrgAccess} benchmark consisting of 40 distinct types ofpermissions commonly relevant across different organizational roles and levels.We further create three types of permissions: 40,000 easy (1 permission),10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) totest LLMs' ability to accurately assess these permissions and generateresponses that strictly adhere to the specified hierarchical rules,particularly in scenarios involving users with overlapping or conflictingpermissions. Our findings reveal that even state-of-the-art LLMs strugglesignificantly to maintain compliance with role-based structures, even withexplicit instructions, with their performance degrades further when navigatinginteractions involving two or more conflicting permissions. Specifically, even\textbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}.This demonstrates a critical limitation in LLMs' complex rule following andcompositional reasoning capabilities beyond standard factual or STEM-basedbenchmarks, opening up a new paradigm for evaluating their fitness forpractical, structured environments.

 

Quick Read (beta)

loading the full paper ...