An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning

Abstract

As the scale and complexity of cloud-based AI systems continue to increase,the detection and adaptive recovery of system faults have become the corechallenges to ensure service reliability and continuity. In this paper, wepropose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integratesLarge Language Model (LLM) and Deep Reinforcement Learning (DRL), aiming torealize a fault recovery framework with semantic understanding and policyoptimization capabilities in cloud AI systems. On the basis of the traditionalDRL-based control model, the proposed method constructs a two-stage hybridarchitecture: (1) an LLM-driven fault semantic interpretation module, which candynamically extract deep contextual semantics from multi-source logs and systemindicators to accurately identify potential fault modes; (2) DRL recoverystrategy optimizer, based on reinforcement learning, learns the dynamicmatching of fault types and response behaviors in the cloud environment. Theinnovation of this method lies in the introduction of LLM for environmentmodeling and action space abstraction, which greatly improves the explorationefficiency and generalization ability of reinforcement learning. At the sametime, a memory-guided meta-controller is introduced, combined withreinforcement learning playback and LLM prompt fine-tuning strategy, to achievecontinuous adaptation to new failure modes and avoid catastrophic forgetting.Experimental results on the cloud fault injection platform show that comparedwith the existing DRL and rule methods, the IFSHM framework shortens the systemrecovery time by 37% with unknown fault scenarios.

Quick Read (beta)

loading the full paper ...