Trust-Oriented Adaptive Guardrails for Large Language Models

  • 2025-02-03 16:03:18
  • Jinwei Hu, Yi Dong, Xiaowei Huang
  • 0

Abstract

Guardrail, an emerging mechanism designed to ensure that large languagemodels (LLMs) align with human values by moderating harmful or toxic responses,requires a sociotechnical approach in their design. This paper addresses acritical issue: existing guardrails lack a well-founded methodology toaccommodate the diverse needs of different user groups, particularly concerningaccess rights. Supported by trust modeling (primarily on `social' aspect) andenhanced with online in-context learning via retrieval-augmented generation (on`technical' aspect), we introduce an adaptive guardrail mechanism, todynamically moderate access to sensitive content based on user trust metrics.User trust metrics, defined as a novel combination of direct interaction trustand authority-verified trust, enable the system to precisely tailor thestrictness of content moderation by aligning with the user's credibility andthe specific context of their inquiries. Our empirical evaluation demonstratesthe effectiveness of the adaptive guardrail in meeting diverse user needs,outperforming existing guardrails while securing sensitive information andprecisely managing potentially hazardous content through a context-awareknowledge base. To the best of our knowledge, this work is the first tointroduce trust-oriented concept into a guardrail system, offering a scalablesolution that enriches the discourse on ethical deployment for next-generationLLM service.

 

Quick Read (beta)

loading the full paper ...