Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora

Abstract

The number of scientific journal articles and reports being published aboutenergetic materials every year is growing exponentially, and thereforeextracting relevant information and actionable insights from the latestresearch is becoming a considerable challenge. In this work we explore howtechniques from natural language processing and machine learning can be used toautomatically extract chemical insights from large collections of documents. Wefirst describe how to download and process documents from a variety of sources- journal articles, conference proceedings (including NTREM), the US Patent &Trademark Office, and the Defense Technical Information Center archive onarchive.org. We present a custom NLP pipeline which uses open source NLP toolsto identify the names of chemical compounds and relates them to function words("underwater", "rocket", "pyrotechnic") and property words ("elastomer","non-toxic"). After explaining how word embeddings work we compare the utilityof two popular word embeddings - word2vec and GloVe. Chemical-chemical andchemical-application relationships are obtained by doing computations with wordvectors. We show that word embeddings capture latent information aboutenergetic materials, so that related materials appear close together in theword embedding space.

Quick Read (beta)

loading the full paper ...