Abstract
Many efforts have been put to use automated approaches, such as naturallanguage processing (NLP), to mine or extract data from free-text medicalrecords to picture comprehensive patient profiles for delivering betterhealth-care. Reusing NLP models in new settings, however, remains cumbersome -requiring validation and/or retraining on new data iteratively to achieveconvergent results. In this paper, we formally define and analyse the NLP model adaptationproblem, particularly in phenotype identification tasks, and identify two typesof common unnecessary or wasted efforts: duplicate waste and imbalance waste. Adistributed representation approach is proposed to represent familiar languagepatterns for an NLP model by learning phenotype embeddings from its trainingdata. Computations on these language patterns are then introduced to help avoidor reduce unnecessary efforts by combining both geometric and semanticsimilarities. To evaluate the approach, we cross validate NLP models developed for sixphysical morbidity studies (23 phenotypes; 17 million documents) on anonymisedmedical records of South London Maudsley NHS Trust, United Kingdom. Two metricsare introduced to quantify the reductions for both duplicate and imbalancewastes. We conducted various experiments on reusing NLP models in fourphenotype identification tasks. Our approach can choose a best model for agiven new task, which can identify up to 76% mentions needing no validation &model retraining, meanwhile, having very good performances (93-97% accuracy).It can also provide guidance for validating and retraining the model for novellanguage patterns in new tasks, which can help save around 80% of the effortsrequired in blind model-adaptation approaches.