Abstract
Access to medical imaging and associated text data has the potential to drivemajor advances in healthcare research and patient outcomes. However, thepresence of Protected Health Information (PHI) and Personally IdentifiableInformation (PII) in Digital Imaging and Communications in Medicine (DICOM)files presents a significant barrier to the ethical and secure sharing ofimaging datasets. This paper presents a hybrid de-identification frameworkdeveloped by Impact Business Information Solutions (IBIS) that combinesrule-based and AI-driven techniques, and rigorous uncertainty quantificationfor comprehensive PHI/PII removal from both metadata and pixel data. Our approach begins with a two-tiered rule-based system targeting explicitand inferred metadata elements, further augmented by a large language model(LLM) fine-tuned for Named Entity Recognition (NER), and trained on a suite ofsynthetic datasets simulating realistic clinical PHI/PII. For pixel data, weemploy an uncertainty-aware Faster R-CNN model to localize embedded text,extract candidate PHI via Optical Character Recognition (OCR), and apply theNER pipeline for final redaction. Crucially, uncertainty quantificationprovides confidence measures for AI-based detections to enhance automationreliability and enable informed human-in-the-loop verification to manageresidual risks. This uncertainty-aware deidentification framework achieves robust performanceacross benchmark datasets and regulatory standards, including DICOM, HIPAA, andTCIA compliance metrics. By combining scalable automation, uncertaintyquantification, and rigorous quality assurance, our solution addresses criticalchallenges in medical data de-identification and supports the secure, ethical,and trustworthy release of imaging data for research.