Abstract
Objective: Healthcare data fragmentation presents a major challenge forlinking patient data, necessitating robust record linkage to integrate patientrecords from diverse sources. This study investigates the feasibility ofleveraging language models for automated patient record linkage, focusing ontwo key tasks: blocking and matching. Materials and Methods: We utilizedreal-world healthcare data from the Missouri Cancer Registry and ResearchCenter, linking patient records from two independent sources usingprobabilistic linkage as a baseline. A transformer-based model, RoBERTa, wasfine-tuned for blocking using sentence embeddings. For matching, severallanguage models were experimented under fine-tuned and zero-shot settings,assessing their performance against ground truth labels. Results: Thefine-tuned blocking model achieved a 92% reduction in the number of candidatepairs while maintaining near-perfect recall. In the matching task, fine-tunedMistral-7B achieved the best performance with only 6 incorrect predictions.Among zero-shot models, Mistral-Small-24B performed best, with a total of 55incorrect predictions. Discussion: Fine-tuned language models achieved strongperformance in patient record blocking and matching with minimal errors.However, they remain less accurate and efficient than a hybrid rule-based andprobabilistic approach for blocking. Additionally, reasoning models likeDeepSeek-R1 are impractical for large-scale record linkage due to highcomputational costs. Conclusion: This study highlights the potential oflanguage models for automating patient record linkage, offering improvedefficiency by eliminating the manual efforts required to perform patient recordlinkage. Overall, language models offer a scalable solution that can enhancedata integration, reduce manual effort, and support disease surveillance andresearch.