E.T.-RNN: Applying Deep Learning to Credit Loan Applications

  • 2019-11-06 17:13:03
  • Dmitrii Babaev, Maxim Savchenko, Alexander Tuzhilin, Dmitrii Umerenkov
  • 1

Abstract

In this paper we present a novel approach to credit scoring of retailcustomers in the banking industry based on deep learning methods. We used RNNson fine grained transnational data to compute credit scores for the loanapplicants. We demonstrate that our approach significantly outperforms thebaselines based on the customer data of a large European bank. We alsoconducted a pilot study on loan applicants of the bank, and the study producedsignificant financial gains for the organization. In addition, our method hasseveral other advantages described in the paper that are very significant forthe bank.

 

Quick Read (beta)

E.T.-RNN: Applying Deep Learning to Credit Loan Applications

Dmitrii Babaev [email protected] Sberbank AI Lab Maxim Savchenko [email protected] Sberbank AI Lab Alexander Tuzhilin New York University [email protected]  and  Dmitrii Umerenkov [email protected] Sberbank AI Lab
201930 July 1999
Abstract.

In this paper we present a novel approach to credit scoring of retail customers in the banking industry based on deep learning methods. We used RNNs on fine grained transnational data to compute credit scores for the loan applicants. We demonstrate that our approach significantly outperforms the baselines based on the customer data of a large European bank. We also conducted a pilot study on loan applicants of the bank, and the study produced significant financial gains for the organization. In addition, our method has several other advantages described in the paper that are very significant for the bank.

credit scoring, recurrent neural networks, card transactions, multivariate time-series
journalyear: 2019copyright: acmlicensedconference: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 4–8, 2019; Anchorage, AK, USAbooktitle: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USAprice: 15.00doi: 10.1145/3292500.3330693isbn: 978-1-4503-6201-6/19/08ccs: Computing methodologies Neural networksccs: Applied computing

1. Introduction

Credit scoring is a very important problem for the banking industry because of the huge financial implications for the banks. Banking industry has developed credit scoring models ever since the middle of the XX century and has perfected these models ever since, investing millions of dollars in this process. Traditional credit scoring models rely on a loan application questionaire, applicants credit history and various other aggregate financial information relevant to the customer’s application. These models use traditional machine learning methods, such as logistic regression, to compute the credit score of a customer that is indicative if the customer would return the loan or not. Although widespread and useful, these models have certain limitations. First, credit scoring require extensive feature engineering and deep domain knowledge in order to design good features. Second, if the customer does not have significant credit history, it is hard to make reliable scoring decisions regarding that person. Third, the currently existing models do not take full advantage of all the data available about the customer in modern settings.

In this paper we propose a novel approach, called Embedding-Transactional Recurrent Neural Network (E.T.-RNN), to compute credit scores of the bank customers by examining history of their credit and debit card transactions. We do it using a deep learning (DL) approach, as opposed to more traditional machine learning methods. Note that this approach is applicable only to those customers that have credit or debit cards with the bank. Since a significant percent of the applicants indeed have credit or debit cards, our method works for a large segment of the applicants. Furthermore our proposed method has the following advantages in comparison to the current credit scoring methods. First, as is shown in the paper, the proposed DL-based method outperforms the baselines, including models currently in use in the bank, resulting in significant financial gains. Second, the proposed DL-based model works directly on the customer transactions and does not need extensive feature engineering requiring deep domain expertise (generating hundreds or thousands of hand-crafted aggregate features). Third, our model works exclusively on the transactional data and, therefore, does not require any additional input from the client. This means that we can make credit loan decisions very fast, ideally in near real-time, because the whole credit scoring process is fully automated. Fourth, information in the transactional data is exceptionally hard to forge. Hence there is no need to check the correctness of the data, unlike the questionnaire and some other data sources used for scoring. Fifth, even the customers without any credit history can be assessed for credit worthiness, their transactional history constituting a source for estimating credit risks. Finally, the proposed method constitutes a fair approach to credit scoring, as it does not use information about the individual and therefore cannot be used to discriminate the credit applicants by various demographic factors. These constitute very significant advantages vis-a-vis current loan practices and have a potential to disrupt the retail banking loan industry.

One issue with the proposed approach constitutes the interpretability of the black-box models, such as neural networks. Different organizations across the world have different philosophies regarding applying black-box models to credit scoring problems. While in some countries lack of interpretability is considered as clear ”no-go”, in other parts of the world it is considered more appropriate to use such models for the credit scoring tasks. Also, there has been significant progress in solving the black box interpretation problems in the last few years (Choi et al., 2016), (Gupta and Schütze, 2018), (McCoy et al., 2019) and we expect this progress to accelerate even faster moving forward. Therefore, we believe that the issue of using the black-box models for credit scoring will be less relevant in the future.

This paper makes the following contributions. First, we propose to use neural networks on the customers fine-grained transactional data for credit scoring applications in the banking industry. Second, we tested our method against the benchmarks on the historical data and achieved superior performance. Third, we conducted a pilot study on loan applicants and produced significant financial gains for the bank.

The rest of the paper is organised as follows. In Section 2 we discuss the related work. In Section 3 we describe the proposed method. In Section 4 we present our experiments and in Section 5 the results of these experiments. Section 6 and 7 are dedicated to the discussion of our results and conclusions.

2. Related work

There is a large amount of research on credit scoring problems for the banking industry going back to first half of the XX century (Durand, 1941). A wide range of methods has been used for this task, including logistic regression (Wiginton, 1980), decision trees (Makowski, 1985), boosting (Bastos, 2008), support vector machines (SVM) (Huang et al., 2007) and neural networks (NN) (West, 2000). Credit scoring methods historically relied on using questionnaire data and applicant’s credit history. However new data sources have been utilized more recently to increase scoring quality by using telecom data (Björkegren and Grissen, 2017) and transactional data (Khandani et al., 2010), (Bellotti and Crook, 2013), (vard Kvamme et al., 2018), (Chi and Hsu, 2012), (Tobback and Martens, 2017).

Most of the previous approaches to credit scoring used aggregated transactional data either globally (Chi and Hsu, 2012) or over some time window, such as a month (Khandani et al., 2010), (Bellotti and Crook, 2013) or a day (vard Kvamme et al., 2018), and most of them relied on the classical ML methods. For example, in (Khandani et al., 2010) authors used generalized classification and regression trees on monthly transactional statistics. In (Bellotti and Crook, 2013) authors used discrete survival models on monthly transactional statistics. Furthermore, some authors used NN-based approaches to credit scoring on the aggregated transactional data. For example in (vard Kvamme et al., 2018) authors applied shallow convolutional neural networks on daily transactional statistics.

Furthermore, (Tobback and Martens, 2017) has developed some credit scoring models on the unaggregated transactional data. However, they used classical ML methods, such as SVMs and weighted-vote relational neighbour classifiers, in their models. Moreover, they focused on the connectivity problem in their work to estimate credit risk and used only information of who transacted with whom, without deploying the full power of the transactional data.

Also, NNs have been applied to the analysis of the transactional data, but in other types of applications. In (Wiese and Omlin, 2009) authors used Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) (Gers et al., 1999) on individual transaction features for detection of fraudulent transactions. For a review of NNs methods in credit card fraud detection see (Abdallah et al., 2016). In (Zhang et al., 2017) authors applied LSTM RNN for predicting credit scores for a peer-to-peer lending platform.

The main contribution of this work is that we use Neural Network methods for traditional banking credit scoring problems on the unaggregated transactional data. In this paper, we use an RNN based method in the credit scoring problem. We describe our approach to this problem in the next section.

3. The method

3.1. Transactional data

Our method computes credit scores using transactional data, each client having multiple credit card transactions, and each transaction having several attributes, both categorical and numerical, and occurring at a certain time. Our data can be described as multivariate time-series data, the schema of which is presented in Table 1. Merchant type field represents the kind of a merchant, such as airline, hotel, restaurant, etc. (note that it is impossible to restore the real merchant organization identifier from this field).

Table 1. Data structure for a single client
Amount 230 5 40
Currency EUR USD USD
Country France US US
Time 16:40 20:15 09:30
Date Jun 21 Jun 21 Jun 22
Merchant Type Restaurant Transportation Household Appliance
Card type Visa Classic Visa Classic Visa Gold
Issuing 90/10735 90/01735 90/01779
Branch
N opened 1 1 1
debit cards
N opened 1 1 1
credit cards

In the next section we describe an architecture of the neural network that computes credit scores using that transactional data.

3.2. Architecture overview

RNNs are used for processing sequential information. In a way, RNNs have ”memory” over previous computations and use information from the previous time-steps in addition to the current input in order to produce next output. This approach is naturally suited for many NLP tasks including text classification, machine translation and language modelling (Mikolov et al., 2010).

Our Embedding-Transctional RNN (E.T.-RNN) architecture is presented in Figure 1 and is inspired by the NLP methods in the context of deep learning (Mikolov et al., 2010). We treated the credit scoring task as a text classification task, using clients as texts and transactions as individual words.

As Figure 1 shows, the E.T.-RNN model consists of three parts: embedding layers, recurrent encoder and classifier. We will explain each of them in the rest of this section. Note that all the parts are trained simultaneously in the end-to-end manner.

3.2.1. Embeddings

Credit card transactions are mapped into a latent space before being passed to the encoder RNN. In particular, each categorical variable in each transaction is encoded to a low-dimensional vector via a corresponding embedding layer. The embedding layers are randomly initialized and trained simultaneously with the encoder. We have treated the timestamp as a collection of categorical variables each representing a date part (hour, weekday, month). Each transaction is represented as a concatenation of scalar variables and embeddings of categorical variables.

3.2.2. Encoder

We used a single layer RNN based on Gated Recurrent Unit (GRU) (Cho et al., 2014) as an encoder. The hidden vector from the last time step was used as the representation of the client. Note that this approach is also commonly used for text analysis (Sutskever et al., 2014).

3.2.3. Classifier

The hidden vector from the last time step is finally passed to the fully connected classifier sub-network. It turned out that a simple linear classifier outperformed several alternative approaches in our experiments and therefore we used it in our architecture.

More generally, we experimented extensively with different types of deep learning architectures, as explained in Section 4.3.1, and the architecture presented in Figure 1 turned out to be the best for our experiments.

Figure 1. Final architecture

3.3. Loss function

In this work we use the standard area under the ROC curve (ROC AUC) performance measure.

Several loss functions can be used as a proxy for the task of maximising ROC AUC, including the classic binary cross-entry loss: LCE(p,y)=-iyilog(pi) and margin ranking loss: LR(p1,p2,y)=max(0,-y*(p1-p2)+margin which directly optimizes ROC AUC.

In the final version of our model we decided to use margin ranking loss with margin 0.01, which showed the best results on our data, as presented in Section 4.3.1.

3.4. Ensembling

Ensembling (Breiman, 1996) is a way to increase both quality of the model and its stability at the expense of time and computational power. In our case, we have a relative abundance of the negative class samples, as described in Section 4.1. Hence it is possible to use different subsamples of the negative class samples for training each model in the ensemble. The specific parameters will be described in Section 4.3.4.

In the final version of our model, we settled to use mean predictions of an ensemble of six separately trained models, as a practical balance between prediction quality and execution time. Ensemble quality gain and other possible ensembling strategies are further explored in Section 4.3.1.

4. Experiments

4.1. Data

The data used for experiments was provided by a large European bank. For our experiments, we took transactional data for the clients who applied for the retail credits. Since strict adherence to the requirements of personal data protection laws was one of the key priorities for the Bank when carrying out the project, we cannot describe our data and share it with the readers. Furthermore, we only considered applicants who already used a debit or a credit card product in the bank. If a client has several cards, then transactions from every card was taken into account.

The available transactional data falls into subcategories: transaction-level (such as timestamp, country, amount, merchant type) and card-level (such as issuing branch, card type). Card-level data is duplicated verbatim for each transaction related to the corresponding card. An example of three typical card transactions is presented in Table 1. We also used two derived features calculated from transactional data:

  • difference in days between the time of current transaction and the time of previous transaction by this customer

  • time in days elapsed from the card issue date until the transaction date

Only the transactions performed before the application date are taken for training and validation.

Our training dataset represented more than 740 thousand clients with approximately 200 million transactions in total. As a target variable, we used the event of default for consumer loan during a year after its disbursement. The period of one year was selected using the performance window attribute, as described in (Siddiqi, 2005).

Due to the risk of data non-stationarity, we have opted to use out-of-time validation strategy, as in (vard Kvamme et al., 2018), instead of out-of-sample validation as used in (Khandani et al., 2010) and (Bellotti and Crook, 2013). Note that our results for the out-of-fold validation were consistently higher than that for the out-of-time validation for a range of architectures and hyperparameters, which is the common situation, as discussed in (Glennon et al., 2008).

We have used a subset of credit applications from 16-month period for training and four-month period for the out-of-time validation. Training and validation sets were the same for each considered model and baseline.

Due to a large disparity between number of positive and negative cases (because of the low default rate at the bank), we settled on the following undersampling strategy: before each experiment we selected all the positive cases and 10 times as much randomly selected negative cases. On each training epoch we used all positive cases and an equal number of negative cases, selected from the pool of negative cases.

All models in this paper where trained on the last 800 transactions for each customer when available, padding by zero was applied when the actual transaction count for a client was lower.

4.2. Baselines

To compare our model with other approaches, we have implemented a logistic regression based model. We have also implemented an additional model that is based on the Gradient Boosting Machine (GBM) method (Friedman, 2001).

Both logistic regression and GBM methods require a large number of hand-crafted aggregate features produced from the transactional data as an input to the classification model. An example of an aggregate feature would be an average spending amount in some category of merchants, such as hotels of the entire transaction history.

We used LightGBM(Ke et al., 2017) implementation of GBM algorithm and created nearly 7000 hand-crafted features for the application.

Similarly, for the logistic regression we manually designed about 400 aggregate features. Weight of evidence coding and binning of predictors (Lund, 2016) was used to transform categorical features.

4.3. Offline execution of our method

4.3.1. Encoder architecture selection

We have experimented with a different architectures of encoders, using Long Short Term Memory (LSTM), Bidirectional Recurrent Cells (Schuster and Paliwal, 1997) and Gated Recurrent Units (GRU). The results of this comparison are presented in Table 2. Based of this comparison we decided to use one-layer GRU because the difference with the best performing bidirectional model was not statistically significant, while increasing complexity of the model and incurring a noticeable computational price.

Table 2. Encoder architecture comparison
Encoder Valid ROC-AUC (STD)
GRU 1-layer 0.8155 (0.0015)
GRU 1-layer Bidirectional 0.8160 (0.0004)
LSTM 1-layer 0.8055 (0.0022)
LSTM 1-layer Bidirectional 0.8058 (0.0027)

4.3.2. Loss function and learning rate

We used a batch size of 32 for the training and the batch size of 768 for validation for all the experiments. When using ranking loss, we introduced the new hyperparameter loss margin size. We found that loss margin size of 0.1 gives the best results among all the loss hyperparameters that we tried, as shown in Table 3.

Table 3. Loss comparison
Loss Valid ROC-AUC (STD)
BCE Loss 0.8124 (0.0016)
Hinge 0.5 0.8104 (0.0026)
Hinge 0.1 0.8168 (0.0017)
Hinge 0.01 0.8155 (0.0016)
Hinge 0.01 + BCE 0.8144 (0.0030)

Learning rate and learning rate reduction schedule is one of the most sensitive hyperparameters which can dramatically change the performance of the model. Note, that the optimal learning rate schedule depends heavily on loss function used, batch size and overall number of parameters in the model. We tried several learning rates and several learning rate reduction regimes and found that for both BCE loss and ranking loss the most effective strategy was an aggressive linear learning rate reduction with gamma=0.5, as shown in Table 4. We also tried unsuccessfully instead of monotonically decreasing the learning rate to vary it cyclically as proposed in (Smith, 2017).

Table 4. Learning rate schedules
Loss Valid ROC-AUC (STD)
gamma = 1 0.8042 (0.0026)
gamma = 0.8 0.8144 (0.0015)
gamma = 0.5 0.8155 (0.0016)
gamma = 0.5, 2 cycles 0.8145 (0.0006)
gamma = 0.5, 3 cycles 0.8111 (0.0027)

4.3.3. Regularization methods

Due to the low number of positive classes, all models exhibit propensity for overfitting. Therefore we tried various types of dropout regularisation, such as:

  • Transaction dropout that randomly drops some of the client transactions with defined probability

  • Transaction shuffle that randomly permutes the order of client transactions

  • Dropout after embedding that randomly zeroes some components after embedding layer

Note that, none of the aforementioned regularization methods proved effective against overfitting, as shown in Figure 2.

Figure 2. Regularization methods

4.3.4. Ensembling methods

We tried several different types of ensembling methods:

  • Simple averaging of model results. Averaging predictions of different models trained with distinct negative class examples leads to both increased accuracy and reduced variabilty of results, as shown in Figure 3

  • Stochastic Weight Averaging (SWA) (Loshchilov and Hutter, 2016). Averaging the weights of ensemble models can significantly reduce inference time since only one model with averaged weights is used instead of the whole ensemble. But in our case averaging of weights of different models led to noticeable reduction in quality.

  • Snapshot ensembling (Huang et al., 2017). Using snapshots of the same model in the final ensemble can significantly reduce training time since only one model should be trained. Unfortunately this approach does not benefit from using distinct negative class examples

  • SWA + snapshot ensembling. We found that combining SWA with snapshot ensembling for single model training by taking snapshots after a set epoch and averaging the weights leads to some reduction of variability, but the results were inconclusive and we opted for not using these advanced ensembling methods in our production model.

Figure 3. Ensemble quality comparison

We opted to use a size six averaging ensemble for our production model, providing a reasonable compromise between model quality and training/inference times. As mentioned in Section 3.4, each model of the ensemble is trained on different subsamples of the negative class samples. As described in Section 4.1, we use undersampling procedure to reduce the number of negative samples. The negative samples are selected independently for each model of the ensemble, hence each model of the ensemble is trained on slightly different subset of negative samples.

4.4. Moving to production

We performed massive field test of our neural scoring model in a bank’s production pipeline. We used the model trained on the same dataset as discussed in Section 4.1 to pre-calculate scores for each client with a debit or credit card. Training of a full six model ensemble took about 4 hours on a Tesla P100 GPU. It took about 17 minutes to score 1 million customers on an Tesla P100 GPU. And the inference time scales linearly with the number of clients.

These scores were used to make decisions about credit applications for tens of thousands of applicants during one month and the early results are very promising.

The potential financial gain was measured for the case if our model is used instead of the current scoring model for the applicants with enough transnational data. The preliminary financial results are measured in the millions of dollars per year, which constitutes a very significant result for the bank of this type and size.

5. Results

Table 5 presents the main results of the experiments described in Section 4.

Table 5. Experiment comparison
ROC AUC N Features
Logistic regression 0.78 400
LGBM 0.81 7000
E.T.-RNN 0.83 12

As shown in Table 5, E.T.-RNN significantly outperformed the baselines on our data. Moreover, one of the crucial features of our approach is that we did not have to do feature engineering for our method, unlike the classical methods which rely heavily on the hand-crafted features (e. g. 400 features for Logistic regression and 7000 features for LGBM).

5.1. Training dataset size

Note that the results presented in Table 5 were achieved on the full dataset described in Section 4.1. We also conducted a series of experiments to estimate model performance for different dataset sizes. As Figure 4, shows LGBM outperforms our approach for small volumes of data, as measured in terms of the number of applications (on the X axis). However, given enough data, E.T.-RNN method significantly outperforms the classical approaches. This observation is in line with the well-known understanding that neural networks outperform classical methods on large datasets.

Also note that E.T.-RNN has steeper learning curve than LGBM. Hence the performance gap would increase even further with more available data.

Figure 4. E.T.-RNN has steeper learning curve than LGBM.

5.2. Transaction count

Performance of our model depends heavily on the number of available transactions per client. As Figure 5 shows, scoring quality increases untill we reach around 350 transactions. Beyond this level, performance increase due to additional transactions is insignificant enough to be overshadowed by statistical variations in the data. Furthermore the share of clients having more than 350 transactions is about 50 percent for our dataset. This means that our model achieves significant hit rate when scoring clients of the bank. On the other hand, our method is still effective even for the applicants with a low number of transactions. For clients with more than 25 transactions (about 95 percent of total number of clients), we reach 82.5 ROC-AUC.

Figure 5. Classification quality vs number of transactions
Figure 6. Classification quality for customers grouped by number of transactions

6. Discussion

Our method worked well for the following reasons:

  • Reasonably large number of customers in training dataset. Neural networks have lots of learnable parameters comparing to the classical approaches and, hence, require more data than classical methods. This is also true in our case as presented on Figure 4.

  • Low-level, granular data. Our data can be described as a series of events and each event consists of several variables. Note that if data structure is relatively simple, our method may not work better than the traditional approaches. For example, for the data from the application questionnaire there is no need for sophisticated neural network models. Even classical ML approaches, like logistic regression would work reasonably well on the data with simple table-like structure.

  • High-frequency data (as discussed before, more than 80 percents of customers have at least 100 transactions.

Our method worked because we applied sophisticated neural network method (as discussed in Section 4.3 we tried numerous other DL-based approaches, and many of them did not work that well) on the data exhibiting aforementioned characteristics.

To summarize, our E.T.-RNN approach would possibly work better than classical methods in cases where data is in low-level, granular form and there is enough data to train complex neural net based model.

7. Conclusions

In this paper we proposed a novel E.T.-RNN method which allows to use fine-grained transactional data for credit scoring. We tested our method against the benchmarks on the historical data and achieved superior performance. We also conducted a pilot study on banking customers and produced significant financial gains for the bank.

The significant advantage of our approach is that even complex multivariate time-series data can be directly used for training without any need for feature design. As was demonstrated in (Erhan et al., 2009), the neural network learns meaningful internal representations of the input data during training, and this drastically reduces the need to generate hundreds or even thousands of hand-crafted aggregate features, as is typically done in credit scoring applications. This means that our method does not require any significant domain-specific expertise for feature design. Also, our model works exclusively on the transactional data and therefore does not require any additional input from the client that means that we can make credit loan decisions very fast, ideally in nearly real-time, because the whole credit scoring process is fully automated. Moreover information in the transactional data is exceptionally hard to forge. Hence, there is no need of costly checks for the correctness of such data, unlike data provided by the client or obtained from some other sources. Still another advantage of our method is that even a person without any credit history can be reliably accessed for credit-worthiness, his or her transactional history constituting a source for estimating credit risks. Finally, this method provides a fair approach to credit decision making because it does not rely on personal demographic information of an individual and, therefore, cannot discriminate applicants based on various demographic factors. For all these reasons, we believe that the proposed credit scoring approach has a potential to disrupt current loan practices in the retail banking industry.

One issue with our method is lack of interpretability. Neural networks constitute black-box models by their nature. The ability to produce rich models on top of raw data representation is the main strength of neural networks. But this ability also leads to significant interpretability problem, which is the main weakness of complex models. Also note that this issue is applicable not only to our method, but also to most other advanced machine learning methods, such as Gradient Boosting Machine, since they also suffer from the lack of interpretability.

Different organizations around the worlds have different philosophies regarding applying black-box models in credit scoring. In some countries, lack of interpertablilty is considered less appropriate while in other parts of the world it is considered more appropriate to do so. We believe that this issue will be less relevant moving forward because of the significant progress in solving the black-box interpretetaion problem, including that of neural networks, that have been achieved over the past few years. (Choi et al., 2016), (Gupta and Schütze, 2018), (McCoy et al., 2019) constitute some examples of the recent work. Based of this progress, the black-box interpretation problem should be successfully addressed moving forward.

As a future work, we plan to study more effective method of regularization, which would allow us to use the data we have available more effectively. Furthermore, we plan to focus on even more effective ways to integrate time into our model. In particular our model is not sensitive to shifting all the customers transactions in time (e.g. shifting by one month back), and we plan to work on this problem. Finally we also plan to work on other types of loans, such as mortgage loans, which differ from retail loans in several respects.

References

  • A. Abdallah, M. A. Maarof, and A. Zainal (2016) Fraud detection system: a survey. Journal of Network and Computer Applications 68, pp. 90–113. Cited by: §2.
  • J. Bastos (2008) Credit scoring with boosted decision trees. Cited by: §2.
  • T. Bellotti and J. Crook (2013) Forecasting and stress testing credit card default using dynamic models. International Journal of Forecasting 29 (4), pp. 563–574. Cited by: §2, §2, §4.1.
  • D. Björkegren and D. Grissen (2017) Behavior revealed in mobile phone usage predicts loan repayment. arXiv preprint arXiv:1712.05840. Cited by: §2.
  • L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §3.4.
  • B. Chi and C. Hsu (2012) A hybrid approach to integrate genetic algorithm into dual scoring model in enhancing the performance of credit scoring model. Expert Systems with Applications 39 (3), pp. 2650–2661. Cited by: §2, §2.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Link, 1406.1078 Cited by: §3.2.2.
  • E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun (2016) RETAIN: interpretable predictive model in healthcare using reverse time attention mechanism. CoRR abs/1608.05745. External Links: Link, 1608.05745 Cited by: §1, §7.
  • D. Durand (1941) Credit-rating formulae. See Risk elements in consumer instalment financing, technical edition, Durand, pp. 83–91. External Links: Link Cited by: §2.
  • D. Durand (1941) Risk elements in consumer instalment financing, technical edition. Book, NBER, National Bureau of Economic Research. External Links: Link Cited by: D. Durand (1941).
  • D. Erhan, Y. Bengio, A. Courville, and P. Vincent (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: §7.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §4.2.
  • F. A. Gers, J. Schmidhuber, and F. Cummins (1999) Learning to forget: continual prediction with lstm. Cited by: §2.
  • D. Glennon, N. M. Kiefer, C. E. Larson, and H. Choi (2008) Development and validation of credit scoring models. Cited by: §4.1.
  • P. Gupta and H. Schütze (2018) LISA: explaining recurrent neural network judgments via layer-wise semantic accumulation and example to pattern transformation. arXiv preprint arXiv:1808.01591. Cited by: §1, §7.
  • C. Huang, M. Chen, and C. Wang (2007) Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications 33 (4), pp. 847 – 856. External Links: ISSN 0957-4174, Document, Link Cited by: §2.
  • G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger (2017) Snapshot ensembles: train 1, get M for free. CoRR abs/1704.00109. External Links: Link, 1704.00109 Cited by: 3rd item.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In NIPS, Cited by: §4.2.
  • A. E. Khandani, A. J. Kim, and A. W. Lo (2010) Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance 34 (11), pp. 2767–2787. Cited by: §2, §2, §4.1.
  • I. Loshchilov and F. Hutter (2016) SGDR: stochastic gradient descent with restarts. CoRR abs/1608.03983. External Links: Link, 1608.03983 Cited by: 2nd item.
  • B. Lund (2016) Weight of evidence coding and binning of predictors in logistic regression. MidWest SAS Users Group conference proceedings. Cited by: §4.2.
  • P. Makowski (1985) Credit scoring branches out. Credit World 75 (1), pp. 30–37. Cited by: §2.
  • R. T. McCoy, T. Linzen, E. Dunbar, and P. Smolensky (2019) RNNs implicitly implement tensor-product representations. In International Conference on Learning Representations, External Links: Link Cited by: §1, §7.
  • T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §3.2, §3.2.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §4.3.1.
  • N. Siddiqi (2005) Credit risk scorecards: developing and implementing intelligent credit scoring. Cited by: §4.1.
  • L. N. Smith (2017) Cyclical learning rates for training neural networks. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 464–472. Cited by: §4.3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. External Links: Link Cited by: §3.2.2.
  • E. Tobback and D. Martens (2017) Retail credit scoring using fine-grained payment data. Working Papers University of Antwerp, Faculty of Business and Economics. External Links: Link Cited by: §2, §2.
  • H. vard Kvamme, N. Sellereite, K. Aas, and S. Sjursen (2018) Predicting mortgage default using convolutional neural networks. Expert Systems with Applications 102, pp. 207 – 217. External Links: ISSN 0957-4174, Document, Link Cited by: §2, §2, §4.1.
  • D. West (2000) Neural network credit scoring models. Computers & Operations Research 27 (11-12), pp. 1131–1152. Cited by: §2.
  • B. Wiese and C. Omlin (2009) Credit card transactions, fraud detection, and machine learning: modelling time with lstm recurrent neural networks. In Innovations in Neural Information Paradigms and Applications, pp. 231–268. External Links: ISBN 978-3-642-04003-0, Document, Link Cited by: §2.
  • J. C. Wiginton (1980) A note on the comparison of logit and discriminant models of consumer credit behavior. Journal of Financial and Quantitative Analysis 15 (03), pp. 757–770. External Links: Link Cited by: §2.
  • Y. Zhang, D. Wang, Y. Chen, H. Shang, and Q. Tian (2017) Credit risk assessment based on long short-term memory model. In International Conference on Intelligent Computing, pp. 700–712. Cited by: §2.