A Deep Patent Landscaping Model using Transformer and Graph Embedding

  • 2019-11-14 17:43:42
  • Seokkyu Choi, Hyeonju Lee, Eunjeong Lucy Park, Sungchul Choi
  • 0

Abstract

Patent landscaping is a method that is employed for searching related patentsduring the process of a research and development~(R\&D) project. To avoid therisk of patent infringement and to follow the current trends of technologydevelopment, patent landscaping is a crucial task that needs to be conductedduring the early stages of an R\&D project. Because the process of patentlandscaping requires several advanced resources and can be tedious, the demandfor automated patent landscaping is gradually increasing.However, the shortageof well-defined benchmarking datasets and comparable models makes it difficultto find related research studies. In this paper, we propose an automated patentlandscaping model based on deep learning. The proposed model comprises amodified transformer structure for analyzing textual data present in patentdocuments and a graph embedding method using diffusion graph called Diff2Vecfor analyzing patent metadata. Four patent landscaping benchmarking datasets,which were produced by querying to Google BigQuery based on search formula madeby the Korean patent attorney, are proposed for comparing related researchstudies. Obtained results indicate that the proposed model with the datasetscan attain state-of-the-art performance comparing current patent landscapingmodels.

 

Quick Read (beta)

A Deep Patent Landscaping Model
using Transformer and Graph Embedding

Seokkyu Choi Hyeonju Lee Eunjeong Park [email protected] Sungchul Choi [email protected] TEAMLAB, Department of Industrial and Management Engineering, Gachon University,
Seongnam-si, Gyeonggi-do, Republic of Korea
NAVER Corp., Seongnam-si, Gyeonggi-do, Republic of Korea
Abstract

Patent landscaping is a method that is employed for searching related patents during the process of a research and development (R&D) project. To avoid the risk of patent infringement and to follow the current trends of technology development, patent landscaping is a crucial task that needs to be conducted during the early stages of an R&D project. Because the process of patent landscaping requires several advanced resources and can be tedious, the demand for automated patent landscaping is gradually increasing.However, the shortage of well-defined benchmarking datasets and comparable models makes it difficult to find related research studies.

In this paper, we propose an automated patent landscaping model based on deep learning. The proposed model comprises a modified transformer structure for analyzing textual data present in patent documents and a graph embedding method using diffusion graph called Diff2Vec for analyzing patent metadata. Four patent landscaping benchmarking datasets, which were produced by querying to Google BigQuery based on search formula made by the Korean patent attorney, are proposed for comparing related research studies. Obtained results indicate that the proposed model with the datasets can attain state-of-the-art performance comparing current patent landscaping models.

keywords:
Patent landscaping, Deep learning, Transformer, Graph embeddings, Patent classification
journal: Journal

1 Introduction

A patent is a significant deliverable in research and development (R&D) projects. A patent protects an assignee’s legal rights and also represents current technology development trends. To study technological trends and seize potential infringement patent, majority of the R&D projects conduct the task of patent landscaping, which involves collecting and analyzing patent documents related to the projects (bubela2013patent; Wittenburg2015-hv; bubela2013patent; Abood2018-fd).

Generally, the task of patent landscaping is a human-centric, tedious, and expensive process (trippe2015guidelines; Abood2018-fd). Researchers and patent attorneys query related patents in large patent databases by creating keyword candidates, eliminate unrelated patent documents and extract only valid patent documents related to their project (Yang2010-mi; Wittenburg2015-hv). However, since the participants of the process have to be familiar with the scientific and technical domain as well, they are costly. Furthermore, the patent landscaping task has to be repeated regularly during a project in progress to search for newly published patents every week or month.

In this paper, we proposed a supervised deep learning model for patent landscaping. The proposed model aims at eliminating repetitive and inefficient process by employing deep learning-based classification models. The proposed model incorporates a modified transformer structure (NIPS2017_7181) and a graph embedding method using a diffusion graph called diff2vec (10.1007/978-3-319-73198-8_9). Since a patent document can contain several textual features and bibliometric data, a modified transformer structure was applied for processing textual data and diff2vec was applied for processing bibliometric data fields constituted as graph-based.

Additionally, because we also aim to contribute research resources towards machine learning-based patent landscaping research, we proposed benchmarking datasets for patent landscaping. Owing to issues such as high cost and data security, appropriate benchmarking datasets for a patent landscaping task have not been open and available. The proposed benchmarking datasets are based on the Korea Intellectual property STrategy Agency(KISTA11 1 https://www.kista.re.kr/)’s patent trend report written by human experts such as patent attorneys. We build benchmarking datasets obtained from Google BigQeury by using keyword queries and valid patents from KISTA patent trends reports filtered by experts. The experimental results indicate that the proposed model with the benchmarking datasets outperforms other existing classification models, and the average classification accuracy for each dataset can be improved by approximately 15%.

2 Patent landscaping

Figure 1: The general process of patent landscaping

The entire patent landscaping process is as shown in Figure 1. First, the technology keyword candidates for the target technology area are extracted to form a search formula or query for patent documents. Because many assignees do not allow their patents to be discovered easily in the search to gain an advantage in the infringement issues that may arise, they tend to write patent titles and abstracts very generically or omit technical details(tseng2007text). Considering this, a complicated search formula should be created to extract as many relevant patent candidates as possible(magdy2011study). A created search formula depends on the patent search system that performs the search. For example, the search query for underwater vehicle device might be created as shown in the box below.

((( virtual* or augment* or mixed* ) or ( real* or environment* or space )) or ( augment* and real* )) and ( (( offshore* or off-shore* or ocean ) or ( plant* or platform* )) or ship* or dock* or carrier or vessel or marine or boat* or drillship or ( drill or ship ) or FPSO or ( float* or ( product* or storag* )) or FPU or LNG or FSRU or OSV or aero* or airplane or aircraft or construction or ( civil or engineer* ) or bridge or building or vehicle or vehicular or automotive or as follows automobile )

As you can see in Figure 1, most parts of the process are conducted manually by experts who have a technical understanding, and some parts of process are repeated. The primary focus in this paper is the regular repetition task going back to search query formulation from valid patent selection. Once the search formula is created, it is necessary to track the new patents published regularly using a similar search formula. Because the first valid patent selection is similar to creating a training dataset of supervised learning, it is one of the tasks that can solve repetitive tasks with text classification. Because these repetitive tasks require a lot of unnecessary effort from experts, there is a high possibility of improving them by utilizing the approach of machine learning.

This paper is not the first study related to the machine learning-based patent landscaping. Typically, there is a proposed Automated Patient Landscaping (APL) study by Abood2018-fd. They compose the dataset for patent landscaping using seed patents created by patent law experts. and then apply a neural network architecture to classify seed patents from composed data. Their method to compose dataset is expansion of related patents. One thing to be interested in is how they organized datasets for learning. Their method to compose a dataset is the expansion of related patents. They designate key patent documents for each technology area as seed patents by experts, and in a humanistic way expanded the patent dataset based on seeding patent documents by using Cooperative Patent Classification(CPC) and family patent information.

Although APL was a meaningful study that opened the possibility of machine learning-based patent landscaping, there is a problem in terms of usage of comparable benchmarking datasets. First, there is no suggestion of a comparable set of benchmarking data. There may be situations in which the dataset they proposed is generated in a heuristic way and the learned model learns that heuristic. The dataset is different from the dataset generated by human experts, and there is difficulty in generating a model that can replace intellectual activity with respect to human patent analysis. In addition, the dataset of the study collected patents in very broad and common technology fields such as ”Machine Learning” and ”IoT.” A typical patent landscaping is conducted on very specific technologies because it is based on projects by companies or research laboratories. We believe these differences make it difficult to apply APL’s approach to the actual patent landscaping tasks.

In addition to APL, research on machine learning-based patent classification has been going on steadily( sureka2009semantic; chen2011ipc; lupu2013patent). In recent study, shalaby2018lstm suggested model for the International Patent Code(IPC) classification by using long short-term memory (LSTM, doi:10.1162/neco.1997.9.8.1735) and  li2018deeppatent proposed a model based on text convolutional neural network (text-CNN). However, the biggest weakness of the studies is the lack of a suitable benchmarking dataset, as well. Unlike the purpose of actual patent landscaping, IPC classification studies predict a fittable IPC code for each patent, which is already granted to all patents by assignees and patent examiners. This is not a useful application for patent landscaping in the real world.

3 KISTA Datasets for patent landscaping

Firstly, We build datasets using KISTA patent report maps. The detail flowchart is in Figure 2.

Figure 2: The general process of patent landscaping

3.1 Data sources

We provide a benchmarking dataset for patent landscaping based on the KISTA patent trends reports22 2 http://biz.kista.re.kr/patentmap/. Every year, Korean Intellectual Property Office(KIPO)33 3 https://www.kipo.go.kr/ publishes more than 100 patent landscaping reports through KISTA. In particular, most reports are available to validate the results of the trends report by disclosing the valid patent list together with the patent search query44 4 Most of the search queries were based on WIPS (https://www.wipson.com) service, which is a local Korean patent database company.. Currently, more than 2,500 reports are disclosed. The kinds of technology in the reports are specific, concrete, and sometimes include fusion characteristics. We have constructed datasets for the four technologies listed in Table  1.

Dataset Full name Important keywords
MPUART Marine Plant Using Augmented Reality Technology hmd, photorealistic, georegistered
1MWDFS Technology for 1MW Dual Frequency System reverse conductive, mini dipole
MRRG Technology for Micro Radar Rain Gauge klystron, bistatic, frequencyagile
GOCS Technology for Geostationary Orbit Complex Satellite rover, pgps, pseudolites
Table 1: Patent landscaping benchmarking dataset

3.2 Data acquisition

To ensure the reproducibility of building patent datasets, we have built the benchmarking datasets using Google BigQuery public datasets. Most of the patent data in the KISTA report are obtained by the required use of a search query of the local Korean patent database service called WIPS. We first constructed a Python module that converts the WIPS query into a Google BigQuery service query, extracted the patent dataset from BigQuery, and marked the valid patents among the extracted patents. In a patent search, different datasets could be extracted depending on the type of publication date and database to be searched. Therefore, we excluded the queried patents published after the original publication date depicted in the report. The BigQuery search queries for patent retrieval we used are added to Appendix I.

3.3 Dataset description

In general, search keywords in patent retrieval are selected as broad and common words. This is because patent assignees purposely write their patents in plain language so that competitors cannot find their patents wellsu10103729. As a result, patent retrieval by keywords results in a large number of patent documents being searched, of which unrelated patent documents are excluded from the patent landscaping process by experts.

We searched for USPTO’s patents in four technology areas using the mentioned search query. As a result, more than a million patent documents are retrieved in three of the four technology domains searched. Among the retrieved patent documents, we designate ”valid patents,” which is marked as related to the technology areas in the KISTA report. ”Valid patent” means the ”True Y label” to be classified in terms of the classification problems. The number of valid patents is less than 1000 in all domains. This is typical imbalanced data, which has a large number of retrieved results compared to valid patents. We get patent information, including metadata from BigQuery, and indicate whether or not they are valid. The final composed dataset is given in Table  2.

Dataset name # of patents # of valid patents Data URL
MPUART 1,469,741 468 https://bit.ly/343JSD8
1MWDFS 1,774,132 927 https://bit.ly/2Wk7kJI
MRRG 2,068,566 225 https://bit.ly/2BTdKGe
GOCS 294,636 653 https://bit.ly/31VBc07
Table 2: Summary of proposed datasets

3.4 CPC based heuristic approach for undersampling

Since the retrieved datasets are extreme imbalanced datasets, the model from these datasets results in deficient classification performance. To handle the problem, we organize new datasets with the undersampling approach. In general, to extract a valid patent, patent experts use CPC or IPC to eliminate unrelated patents in the first step of the patent landscaping. Because of the patent characteristic, we use CPC information to make undersampling datasets. Firstly, we split valid patents into the training set, validation set, and test set with the split ratio 6, 2, and 2. Next, negative samples, not valid patents, are extracted from the entire retrieved search result.

We designate negative samples which not containing important CPCs from the valid patents. The Important CPC means the CPCs are 0.5% or more appeared in the valid patents for each technology area, and the emergence ratio of the CPCs in the valid patent set is more than 50 times compared to the CPC’s emergence ratio in the entire USPTO patent. This method is a reverse approach to Abood2018-fd’s method to increase the number of patents involved. The experiment found the 0.5% ratio as the minimum rate at which the valid patents not excluded. The number of important CPCs for the undersampling dataset is in Table 3. The sampled datasets are shown in Table  4.

Dataset name # of CPCs in valid patent set # of important CPCs
MPUART 1081 147
1MWDFS 2543 145
MRRG 611 217
GOCS 1269 179
Table 3: The number of import CPCs in valid patents
Dataset name # of train # of validation # of test # of positive
MPUART 50,280 10,094 10,094 280:94:94
1MWDFS 50,556 10,185 10,186 556:185:186
MRRG 50,135 10,045 10,045 135:45:45
GOCS 50,391 10,131 10,131 391:131:131
Table 4: Summary of sampled datasets

4 Deep Patent Landscaping Model

4.1 Model overview

Our proposed deep patent landscaping model is composed of two parts as shown in Figure 3: a transformer encoder(NIPS2017_7181) and a graph embedding using a diffusion graph called Diff2Vec(10.1007/978-3-319-73198-8_9). The model contains a concatenation layer of the embedding vectors and stacked neural net layers to classify valid patents. A patent, which is a scientific document, contains textual data and metadata fields called bibliometric information. We converted the base features of these patents into embedding spaces considering the characteristics of each feature and then learned using the neural network.

Figure 3: The arhcitecture of deep patent landscaping model

4.2 Base features

In order to build a proposed valid patent classifier, an appropriate feature must be selected in a patent document. Patents have a variety of features. Text data and metadata are two representative features that can be used for a classification model.

Text data includes a title, abstract, description of an invention, and claim. The description in the patent is a huge amount of description for the invention, and the claim is a description of the legal rights of the invention. They are rather complicated and contain too many explanations. Thus, a title and abstract which are general descriptions for an invention of a patent are generally utilized for a patent classification model(ZHANG20161108; CHEN201763; li2018deeppatent; shalaby2018lstm).

The metadata contains a technology classification code, assignee, inventor, citation, and so on. Because the information on inventors and assignees is extensive and the names may be incorrect or the same names may be misused, they are not appropriate to use them as features for the classification model. There is also a problem that the elements of the features increase as the new patents continue to increase. Therefore, in the study on patent classification, technology classification codes have been continuously utilized. IPC and CPC are typical technology classification codes utilized in patent offices in all of the worlds(CHEN2011309; Benson2015; doi:10.1002/asi.23664; WU2016305; PARK2017170; SUOMINEN2017131). It also has a national own classification code, such as USPC in the US and F-term in Japan. Because this research is targeting the USPTO’s dataset, we utilize IPC, CPC, and USPC as the basic functions of metadata.

In summary, we use a abstract for text features, and IPC, CPC, and USPC for metadata. To train the features of the patents, we apply the appropriate encoding process to the features into the consideration of the characteries of the features.

4.3 Diff2Vec for metadata embeddings

We build embeddings of technology code, which is metadata, to use them as an input source of the proposed model. The metadata, IPC, CPC, and USPC, are represented as a technology code information as shown in Table 5. Each technology classification code has more than about 70,000 technology classification numbers. Let P={p1,p2,,pn} be the set of patent documents, where n is the total number of patents in P. One document contains one or more technical codes, and we define three sets IPC, CPC, and USPC. Each set has their own classificaion codes. So, let IPC={ipc1,ipc2,,ipcmipc}, CPC={cpc1,cpc2,,cpcmcpc}, and USPC={uspc1,uspc2,,uspcmuspc} be the sets of IPC, CPC, and USPC repectively. We define mx as the total number of classification codes in IPC, CPC, and USPC. One patent document has mutltiple classification codes. For example, if p32 has ipc5, ipc102, and ipc764, then we use p32IPC={ipc5,ipc102,ipc764} to describe them. We use the information that each technology code simultaneously appears in a single patent to create a co-occurence matrix, and express it as graph information. The transformation process for building co-occurnce graph is shown in Figure 4.

Code Full name examples
IPC International Patent Classification E21B33/129, E21B43/11, E21B34/06
USPC United States Patent Classification 362/225., 362/230., 315/294.
CPC Cooperative Patent Classification Y02E40/642, H01L39/2419, Y10T29/49014
Table 5: Full name of classification codes
Figure 4: The process of transforming a technology code to a co-occurence graph

After transforming metadata information as graph representation, we adopt Diff2Vec(10.1007/978-3-319-73198-8_9) to the graph representation to make them put into the proposed neural network model. Diff2Vec is a graph embedding method based on Word2Vec(NIPS2013_5021). it uses diffusion process for extracting neighbor node’s subgraph called diffusion graph. The subgraph is formed by being diffused by neighboring nodes randomly selected based on one node in the subgraph. And then, the Euler tour is applied to the diffusion graph to generate sequence. The sequences generated by Euler tour are used to train Word2Vec layer. We set the length of diffusion at 40 and number of diffusions per node at 10. According to experiments, Diff2Vec scales better as the graph’s density increases and embedding preserves graph distances to a high accuracy. In our model architecture, We used pretrained Diff2Vec for the three classification codes embedding layer. We averaged the embedding values of each code to combine the graph information for one patent. And then, Using dense layer for processing averaged graph information. We process CPC to 256, twice the Diff2Vec embedding size, and other codes to 128. Because CPC is the most granular classification code, so we wanted to use more information about CPC than other codes. The detailed pre-trained process for metadata information is shown in Figure 5

Figure 5: The pretraining process of metadata graph embeddings

4.4 Transformer architecture for text data

Another core building block of our model is the transformer layer for text data. To handle text data, we first extract abstracts of each patent, divide paragraphs by token, and build embeddings of tokens using Word2vec(NIPS2013_5021). When we tokenized the abstract text, the tag [CLS] was inserted at the beginning, and the tag [SEP] was inserted at the end of the sentence. Then we put the embeddings to the transformer encoder(NIPS2017_7181) to learn the latent space for the patent abstract paragraph. We stack the encoder layer 6 times. We also use multi-head self-attention and scaled dot-product attention without the modification of a transformer encoder. We set the number of heads of multi-head self-attention at 8. We set the sequence length of 128, and the hidden size was 512.

4.5 Training and inference phrase

Finally, we add abstraction embedding vectors from metadata and textdata by concatenating both and put them into a simple Multi-Layer Perceptron (MLP) structure. To concatenate the output of the transformer with the classification code embedding vectors, we adopted the squeeze technique from BERT(devlin-etal-2019-bert) and converted the matrix (sequence_lth, embedding size) to vector (embedding size) based on [CLS]. In order to classify whether a target patent is a valid patent or not, we set the binary cross entry in the last layer.

5 Experiments

5.1 Dataset

We measured the performance of the proposed model for the classification of valid patents in the four KISTA datasets. More than half of the datasets were over one million. In this case, those large datasets may contain search formula’s keywords, but also contain noisy patents which are out of the domain. And extracting embeddings from those datasets and using it for model training requires many computer resources. So we use high-frequency CPC codes for heuristic sampling to filter noisy data.

5.2 Hyperparameter settings

In the transformer(NIPS2017_7181), basically 6 encoder layers were stacked, and the number of multi-head attention was 8. And another model consists of 12 encoder layers and 4 attention head. The total learning epoch, the batch size, the optimizer, the learning rate and the epsilon were 20, 64, Adam Optimizer(2014arXiv1412.6980K), 0.0001, and 1e-8 respectively. We set the sequence length, which is the maximum length of the input sentence, to 128, and padded it to 0 if it was shorter than 128. As a result, 512-dimensional embedding vectors were extracted for each word.

5.3 Evaluation metric

We used the average precision and f1 score, which are commonly used in binary classification problems for a imbalanced dataset, as an evaluation metric. We use APL(Abood2018-fd), Word2vec, and Diff2vec based classifier as comparable models for performance comparison55 5 We modified the APL code to be worked on our dataset..

6 Results of experiments

6.1 Overall results

Our model considers two features of patent document: metadata and text data. We conduct an experiment with our proposed model to check how each of these features affects classification performance. And for metadata, we identified how CPC, IPC, and USPC each affect performance. IPC is an internationally unified patent classification system with five hierarchy and consists of approximately 70,000 codes. USPC is a US patent Classification system that is classified based on claims and consists of about 150,000 codes. CPC is the latest patent classification system, reflecting the new classification according to technology development, which is a more detailed classification system than IPC and developed based on ECLA and USPC and consists of about 260,000 codes. As well, we identify how the Transformer configuration affects text data in perspective of classification performance. We compare the classification performance of our model with APL, which is the latest patent landscaping deep learning model. Experimental results show that our model outperforms all other models and that model performs well even when classified using only classification codes. The overall results are shown in Table 8.

Dataset TRF+DIFF TRF DIFF APL
AP F1 AP F1 AP F1 AP F1
MPUART 0.6552 0.8025 0.4746 0.6684 0.6045 0.7711 0.3028 0.5340
1MWDFS 0.566 0.7438 0.4527 0.6564 0.5429 0.7285 0.4155 0.6055
MRRG 0.6871 0.823 0.4960 0.6988 0.6792 0.8208 0.2065 0.4086
GOCS 0.4286 0.6467 0.3742 0.5966 0.3825 0.6019 0.3277 0.5424
Table 6: Average precision and F1 scores of baseline and proposed model

6.2 Affects of techonogy code metadata

As shown in Table 7, we conduct experiments for each code to analyze the effect on each code. As a result, CPC, the most subdivided classification, shows the highest classification performance. However, the performance of UPSC was slightly higher than that of CPC in GOCS data, so we performed quantitative analysis to find the reason. For fair comparison, the dimension of the density layer after the graph embedding layer is 128 for all classification codes.

Dataset TRF+DIFF text+cpc text+ipc text+uspc
AP F1 AP F1 AP F1 AP F1
MPUART 0.6552 0.8025 0.6321 0.7835 0.586 0.7606 0.5372 0.7227
1MWDFS 0.566 0.7438 0.5384 0.7069 0.4902 0.6883 0.4669 0.6776
MRRG 0.6871 0.823 0.6634 0.8069 0.5067 0.7059 0.6195 0.7814
GOCS 0.4286 0.6467 0.4071 0.6301 0.3922 0.6151 0.4140 0.6347
Table 7: Assessing influence by code

6.3 Affects of text data

We experimented with different sizes of transformer and several text embedding methods. Our proposed model shows high performance in most datasets, but the MRRG dataset shows better performance with different hyperparamter of the transformer configuration. The MRRG data set was significantly less classification performance than other datasets. For this reason, we believe that organizing the transformer structure for text more deeply than the classification using code only shows better performance. In other words, if the number of valid patents is small, it is judged that there is more reliance on text than on technology code. And we found that the MRRG dataset’s average sequence length is the shortest so we thought that it can achieve high performance with only four attention heads. In addition, the overall performance difference was not significant when using other text embedding techniques. However, Doc2Vec’s performance was higher than other embedding techniques.

Dataset TRF(6,8)+DIFF TRF(12,4)+DIFF Word2Vec+DIFF Doc2Vec+DIFF Fasttext+DIFF
AP F1 AP F1 AP F1 AP F1 AP F1
MPUART 0.6552 0.8025 0.6208 0.7810 0.6183 0.7739 0.65 0.7975 0.6165 0.7748
1MWDFS 0.566 0.7438 0.5667 0.7404 0.5279 0.7123 0.556 0.7312 0.5371 0.7083
MRRG 0.6871 0.823 0.7384 0.8426 0.6414 0.7895 0.7020 0.8289 0.6835 0.8212
GOCS 0.4286 0.6467 0.3845 0.6027 0.3603 0.5918 0.3915 0.6148 0.3367 0.5556
Table 8: Comparison with embedding models

6.4 Lesson learned for the experiments

We have obtained the following to consider from the experiment results of the patent classification model.

  • 1.

    In the classification task of patent documents, which are scholarly big data containing metadata and text data, using both two features together is an important approach to make better classification performance than using only each individual feature.

  • 2.

    Technology code is an important role in patent document classification. This may be due to reasons that the technology code may be used as the primary criterion for classification when experts work on patent classifications.

  • 3.

    Important technology classification codes may vary depending on the characteristics of the dataset. In general, however, CPCs, which are more detailed technology codes, guarantee results in better classification performance.

  • 4.

    Depending on the dataset, other technology codes except CPC may become important. The number of technology codes that a valid patent has in that dataset is an important feature for patent classification. For example, in the case of the GOCS dataset, USPC has a slightly higher impact on classification performance because the number of USPCs in valid patents is proportionally much higher than CPCs.

  • 5.

    In the case of more extreme imbalanced datasets, it may be helpful to learn the transformer more deeply than the effect on the technology code. When the number of CPC codes of valid patents is reduced, the model learns the classification pattern from text data.

  • 6.

    Like any other text classification model, patent documents show the high-performance when transformer architecture is used. However, given the efficiency of the model, Doc2vec can also be a good alternative to text data.

7 Conclusion

In this paper, we proposed a deep patent landscaping model that solves the classification problem in patent landscaping using a transformer and Diff2Vec structures. Our research contributes to the following three issues in the research of patent landscaping. First, this research suggested a new benchmarking dataset for the automated patent landscaping task and worked to make it a practical study for automated patent landscaping. Second, it showed a high overall classification performance in patent landscaping work compared to existing models. Finally, we experimentally analyzed how technical codes and text data affect the model in the patent classification work. We believe this research will be possible to reduce the repetitive patent analysis tasks of practitioners performing patent analysis tasks.

Further research is needed in the point of view of patent classification. There are various metadata such as assignee, Inventor, and citation in patent documents. By considering these features at the same time, it is necessary to identify how much classification performance can be achieved. Different datasets require different types of classification models. We need to develop models that fit the different datasets. It is expected that this will be solved through the research of meta-learning and AutoML, which are current issues in the field of deep learning.

8 Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant and funded by the Korean government (No. NRF-2015R1C1A1A01056185 and No. NRF-2018R1D1A1B07045825). We really appreciate Ph.D. Min and Ph.D. Kim, living in southern area of Gyeonggi-do in Korea. They gave us a lot of inspiration and courage to write this paper.

Appendix A BigQuery Search Query for Patent Datasets

Dataset Name Query
MPUART (((REGEXP_CONTAINS(description.text, ” virtual%”) or REGEXP_CONTAINS(description.text, ” augment%”) or REGEXP_CONTAINS(description.text, ”mixed%”)) or (REGEXP_CONTAINS(description.text, ” real%”) or REGEXP_CONTAINS(description.text, ” environment%”) or REGEXP_CONTAINS(description.text, ” space ”))) or (REGEXP_CONTAINS(description.text, ” augment%”) and REGEXP_CONTAINS(description.text, ” real%”))) and (((REGEXP_CONTAINS(description.text, ” offshore%”) or REGEXP_CONTAINS(description.text, ” off-shore%”) or REGEXP_CONTAINS(description.text, ” ocean ”)) or (REGEXP_CONTAINS(description.text, ” plant%”) or REGEXP_CONTAINS(description.text, ” platform%”))) or REGEXP_CONTAINS(description.text, ” ship%”) or REGEXP_CONTAINS(description.text, ” dock%”) or REGEXP_CONTAINS(description.text, ” carrier ”) or REGEXP_CONTAINS(description.text, ” vessel ”) or REGEXP_CONTAINS(description.text, ” marine ”) or REGEXP_CONTAINS(description.text, ” boat%”) or REGEXP_CONTAINS(description.text, ” drillship ”) or (REGEXP_CONTAINS(description.text, ” drill ”) or REGEXP_CONTAINS(description.text, ” ship ”)) or REGEXP_CONTAINS(description.text, ” FPSO ”) or (REGEXP_CONTAINS(description.text, ” float%”) or (REGEXP_CONTAINS(description.text, ” product%”) or REGEXP_CONTAINS(description.text, ” storag%”))) or REGEXP_CONTAINS(description.text, ” FPU ”) or REGEXP_CONTAINS(description.text, ” LNG ”) or REGEXP_CONTAINS(description.text, ” FSRU ”) or REGEXP_CONTAINS(description.text, ” OSV ”) or REGEXP_CONTAINS(description.text, ” aero%”) or REGEXP_CONTAINS(description.text, ” airplane ”) or REGEXP_CONTAINS(description.text, ” aircraft ”) or REGEXP_CONTAINS(description.text, ” construction ”) or (REGEXP_CONTAINS(description.text, ” civil ”) or REGEXP_CONTAINS(description.text, ” engineer%”)) or REGEXP_CONTAINS(description.text, ” bridge ”) or REGEXP_CONTAINS(description.text, ” building ”) or REGEXP_CONTAINS(description.text, ” vehicle ”) or REGEXP_CONTAINS(description.text, ” vehicular ”) or REGEXP_CONTAINS(description.text, ” automotive ”) or REGEXP_CONTAINS(description.text, ” automobile ”))
1MWDFS (((REGEXP_CONTAINS(description.text, ” inducti%”) or REGEXP_CONTAINS(description.text, ” heating ”)) or (REGEXP_CONTAINS(description.text, ” induction ”) or REGEXP_CONTAINS(description.text, ” hardening ”)) or (REGEXP_CONTAINS(description.text, ” contour ”) or REGEXP_CONTAINS(description.text, ” hardening ”)) or (REGEXP_CONTAINS(description.text, ” surface ”) or REGEXP_CONTAINS(description.text, ” hardening ”))) and (REGEXP_CONTAINS(description.text, ” dual-frequency ”) or REGEXP_CONTAINS(description.text, ” multi-frequency ”) or ((REGEXP_CONTAINS(description.text, ” dual ”) or REGEXP_CONTAINS(description.text, ” multi ”)) or REGEXP_CONTAINS(description.text, ” frequency ”)) or (REGEXP_CONTAINS(description.text, ” frequency ”) or (REGEXP_CONTAINS(description.text, ” selectable ”) or REGEXP_CONTAINS(description.text, ” variable ”))))) or ((REGEXP_CONTAINS(description.text, ” Inducti%”) or REGEXP_CONTAINS(description.text, ” heating ”)) and ((REGEXP_CONTAINS(description.text, ” contour ”) or REGEXP_CONTAINS(description.text, ” hardening ”)) or (REGEXP_CONTAINS(description.text, ” surface ”) or REGEXP_CONTAINS(description.text, ” hardening ”))))
MRRG ((REGEXP_CONTAINS(description.text, ” precipitat ”) or REGEXP_CONTAINS(description.text, ” rain ”) or REGEXP_CONTAINS(description.text, ” snow ”) or REGEXP_CONTAINS(description.text, ” weather ”) or REGEXP_CONTAINS(description.text, ” climate ”) or REGEXP_CONTAINS(description.text, ” meteor ”) or REGEXP_CONTAINS(description.text, ” downpour ”) or REGEXP_CONTAINS(description.text, ” cloudburst ”) or REGEXP_CONTAINS(description.text, ” deluge ”) or REGEXP_CONTAINS(description.text, ” flood ”) or REGEXP_CONTAINS(description.text, ” disaster ”) or (REGEXP_CONTAINS(description.text, ” wind ”) or (REGEXP_CONTAINS(description.text, ” field ”) or REGEXP_CONTAINS(description.text, ” speed ”) or REGEXP_CONTAINS(description.text, ” velocit ”) or REGEXP_CONTAINS(description.text, ” direction ”))) or REGEXP_CONTAINS(description.text, ” storm ”) or REGEXP_CONTAINS(description.text, ” hurricane ”)) and ((REGEXP_CONTAINS(description.text, ” radio ”) or (REGEXP_CONTAINS(description.text, ” wave ”) or REGEXP_CONTAINS(description.text, ” signal ”) or REGEXP_CONTAINS(description.text, ” frequency ”))) or ((REGEXP_CONTAINS(description.text, ” electr ”) or REGEXP_CONTAINS(description.text, ” micro ”)) or REGEXP_CONTAINS(description.text, ” wave ”)) or REGEXP_CONTAINS(description.text, ” beam ”)) and (REGEXP_CONTAINS(description.text, ” verif ”) or REGEXP_CONTAINS(description.text, ” check ”) or REGEXP_CONTAINS(description.text, ” invest ”) or REGEXP_CONTAINS(description.text, ” experiment ”) or REGEXP_CONTAINS(description.text, ” test ”) or REGEXP_CONTAINS(description.text, ” simulat ”)))
GOCS ((REGEXP_CONTAINS(description.text, ” satellite ”)) and (REGEXP_CONTAINS(description.text, ” band ”) or REGEXP_CONTAINS(description.text, ” illumination ”) or REGEXP_CONTAINS(description.text, ” illuminance ”)) and (REGEXP_CONTAINS(description.text, ” merge ”) or REGEXP_CONTAINS(description.text, ” merging ”) or REGEXP_CONTAINS(description.text, ” fusion ”) or REGEXP_CONTAINS(description.text, ” mosaic ”)))

References