Relevant Word Order Vectorization for Improved Natural Language Processing in Electronic Healthcare Records

  • 2018-12-06 16:01:13
  • Jeffrey Thompson, Jinxiang Hu, Dinesh Pal Mudaranthakam, David Streeter, Lisa Neums, Michele Park, Devin C. Koestler, Byron Gajewski, Matthew S. Mayo
  • 3

Abstract

Objective: Electronic health records (EHR) represent a rich resource forconducting observational studies, supporting clinical trials, and more.However, much of the relevant information is stored in an unstructured formatthat makes it difficult to use. Natural language processing approaches thatattempt to automatically classify the data depend on vectorization algorithmsthat impose structure on the text, but these algorithms were not designed forthe unique characteristics of EHR. Here, we propose a new algorithm forstructuring so-called free-text that may help researchers make better use ofEHR. We call this method Relevant Word Order Vectorization (RWOV). Materials and Methods: As a proof-of-concept, we attempted to classify thehormone receptor status of breast cancer patients treated at the University ofKansas Medical Center during a recent year, from the unstructured text ofpathology reports. Our approach attempts to account for the semi-structured waythat healthcare providers often enter information. We compared this approach tothe ngrams and word2vec methods. Results: Our approach resulted in the most consistently high accuracy, asmeasured by F1 score and area under the receiver operating characteristic curve(AUC). Discussion: Our results suggest that methods of structuring free text thattake into account its context may show better performance, and that ourapproach is promising. Conclusion: By using a method that accounts for the fact that healthcareproviders tend to use certain key words repetitively and that the order ofthese key words is important, we showed improved performance over methods thatdo not.

 

Quick Read (beta)

loading the full paper ...