AWE-CM Vectors: Augmenting Word Embeddings with a Clinical Metathesaurus

  • 2017-12-05 03:11:07
  • Willie Boag, Hassan KanĂ©
  • 0

Abstract

In recent years, word embeddings have been surprisingly effective atcapturing intuitive characteristics of the words they represent. These vectorsachieve the best results when training corpora are extremely large, sometimesbillions of words. Clinical natural language processing datasets, however, tendto be much smaller. Even the largest publicly-available dataset of medicalnotes is three orders of magnitude smaller than the dataset of the oft-used"Google News" word vectors. In order to make up for limited training datasizes, we encode expert domain knowledge into our embeddings. Building on aprevious extension of word2vec, we show that generalizing the notion of aword's "context" to include arbitrary features creates an avenue for encodingdomain knowledge into word embeddings. We show that the word vectors producedby this method outperform their text-only counterparts across the board incorrelation with clinical experts.

 

Quick Read (beta)

loading the full paper ...