Leveraging large language models for SQL behavior-based database intrusion detection

Abstract

Database systems are extensively used to store critical data across variousdomains. However, the frequency of abnormal database access behaviors, such asdatabase intrusion by internal and external attacks, continues to rise.Internal masqueraders often have greater organizational knowledge, making iteasier to mimic employee behavior effectively. In contrast, externalmasqueraders may behave differently due to their lack of familiarity with theorganization. Current approaches lack the granularity needed to detectanomalies at the operational level, frequently misclassifying entire sequencesof operations as anomalies, even though most operations are likely to representnormal behavior. On the other hand, some anomalous behaviors often resemblenormal activities, making them difficult for existing detection methods toidentify. This paper introduces a two-tiered anomaly detection approach forStructured Query Language (SQL) using the Bidirectional Encoder Representationsfrom Transformers (BERT) model, specifically DistilBERT, a more efficient,pre-trained version. Our method combines both unsupervised and supervisedmachine learning techniques to accurately identify anomalous activities whileminimizing the need for data labeling. First, the unsupervised method usesensemble anomaly detectors that flag embedding vectors distant from learnednormal patterns of typical user behavior across the database (out-of-scopequeries). Second, the supervised method uses fine-tuned transformer-basedmodels to detect internal attacks with high precision (in-scope queries), usingrole-labeled classification, even on limited labeled SQL data. Our findingsmake a significant contribution by providing an effective solution forsafeguarding critical database systems from sophisticated threats.

Quick Read (beta)

loading the full paper ...