Core Safety Values for Provably Corrigible Agents

  • 2025-11-19 14:51:55
  • Aran Nayebi
  • 0

Abstract

We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon "decidable island" where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.

 

Quick Read (beta)

loading the full paper ...