A Survey on Sentence Embedding Models Performance for Patent Analysis

Abstract

Patent data is an important source of knowledge for innovation research.While the technological similarity between pairs of patents is a key enablingindicator for patent analysis. Recently researchers have been using patentvector space models based on different NLP embeddings models to calculatetechnological similarity between pairs of patents to help better understandinnovations, patent landscaping, technology mapping, and patent qualityevaluation. To the best of our knowledge, there is not a comprehensive surveythat builds a big picture of embedding models' performance for calculatingpatent similarity indicators. Therefore, in this study, we provide an overviewof the accuracy of these algorithms based on patent classification performance.In a detailed discussion, we report the performance of the top 3 algorithms atsection, class, and subclass levels. The results based on the first claim ofpatents show that PatentSBERTa, Bert-for-patents, and TF-IDF Weighted WordEmbeddings have the best accuracy for computing sentence embeddings at thesubclass level. According to the first results, the performance of the modelsin different classes varies which shows researchers in patent analysis canutilize the results of this study for choosing the best proper model based onthe specific section of patent data they used.

Quick Read (beta)

loading the full paper ...