MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning

  • 2019-08-12 07:11:36
  • Jeremy Charlier, Gaston Ormazabal, Radu State, Jean Hilger
  • 0

Abstract

Reinforcement learning has become one of the best approach to train acomputer game emulator capable of human level performance. In a reinforcementlearning approach, an optimal value function is learned across a set ofactions, or decisions, that leads to a set of states giving different rewards,with the objective to maximize the overall reward. A policy assigns to eachstate-action pairs an expected return. We call an optimal policy a policy forwhich the value function is optimal. QLBS, Q-Learner in theBlack-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, andnoticeably, the popular Q-learning algorithm, to the financial stochastic modelof Black, Scholes and Merton. It is, however, specifically optimized for thegeometric Brownian motion and the vanilla options. Its range of application is,therefore, limited to vanilla option pricing within the financial markets. Wepropose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcementlearning approach that determines the optimal policy of money management basedon the aggregated financial transactions of the clients. It unlocks newfrontiers to establish personalized credit card limits or bank loanapplications, targeting the retail banking industry. MQLV extends thesimulation to mean reverting stochastic diffusion processes and it uses adigital function, a Heaviside step function expressed in its discrete form, toestimate the probability of a future event such as a payment default. In ourexperiments, we first show the similarities between a set of historicalfinancial transactions and Vasicek generated transactions and, then, weunderline the potential of MQLV on generated Monte Carlo simulations. Finally,MQLV is the first Q-learning Vasicek-based methodology addressing transparentdecision making processes in retail banking.

 

Quick Read (beta)

MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning

Jeremy Charlier 1University of Luxembourg, L-1855 Luxembourg, Luxembourg
1{[email protected]}@uni.lu 2Columbia University, New York NY 10027, USA
2{jjc2292,[email protected]}@columbia.edu
   Gaston Ormazabal 2Columbia University, New York NY 10027, USA
2{jjc2292,[email protected]}@columbia.edu
   Radu State 1University of Luxembourg, L-1855 Luxembourg, Luxembourg
1{[email protected]}@uni.lu
   Jean Hilger 3BCEE, L-1160 Luxembourg, Luxembourg

[email protected]
Abstract

Reinforcement learning has become one of the best approach to train a computer game emulator capable of human level performance. In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each state-action pairs an expected return. We call an optimal policy a policy for which the value function is optimal. QLBS, Q-Learner in the Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and noticeably, the popular Q-learning algorithm, to the financial stochastic model of Black, Scholes and Merton. It is, however, specifically optimized for the geometric Brownian motion and the vanilla options. Its range of application is, therefore, limited to vanilla option pricing within the financial markets. We propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement learning approach that determines the optimal policy of money management based on the aggregated financial transactions of the clients. It unlocks new frontiers to establish personalized credit card limits or bank loan applications, targeting the retail banking industry. MQLV extends the simulation to mean reverting stochastic diffusion processes and it uses a digital function, a Heaviside step function expressed in its discrete form, to estimate the probability of a future event such as a payment default. In our experiments, we first show the similarities between a set of historical financial transactions and Vasicek generated transactions and, then, we underline the potential of MQLV on generated Monte Carlo simulations. Finally, MQLV is the first Q-learning Vasicek-based methodology addressing transparent decision making processes in retail banking.

Keywords:
Q-Learning Monte Carlo Payment Transactions.
\usetikzlibrary

arrows, patterns \usetikzlibrarydecorations.pathreplacing,angles,quotes

1 Introduction

A major goal of the reinforcement learning (RL) and Machine Learning (ML) community is to build efficient representations of the current environment to solve complex tasks. In RL, an agent relies on multiple sensory inputs and past experience to derive a set of plausible actions to solve a new situation [mnih2013playing]. While the initial idea around RL is not new [sutton1984temporal, watkins1989learning, williams1987class], significant progress has been achieved recently by combining neural networks and Deep Learning (DL) with RL. The progress of DL [krizhevsky2012imagenet, sermanet2013pedestrian] has allowed the development of a novel agent combining RL with a class of deep artificial neural networks [mnih2013playing, mnih2015human] resulting in Deep Q Network (DQN). The Q refers to the Q-learning algorithm introduced in [watkins1992q]. It is an incremental method that successively improves its evaluations of the quality of the state-action pairs. The DQN approach achieves human level performance on Atari video games using unprocessed pixels as inputs. In [van2016deep], deep RL with double Q-Learning was proposed to challenge the DQN approach while trying to reduce the overestimation of the action values, a well-known drawback of the Q-learning and DQN methodologies. The extension of the DQN approach from discrete to continuous action domain, directly from the raw pixels to inputs, was successfully achieved for various simulated tasks [lillicrap2015continuous].

Nonetheless, most of the proposed models focused on gaming theory and computer game simulation and very few to the financial world. In QLBS [halperin2017qlbs], a RL approach is applied to the Black, Scholes and Merton financial framework for derivatives [black1973pricing, merton1973theory], a cornerstone of the modern quantitative finance. In the BSM model, the dynamic of a stock market is defined as following a Geometric Brownian Motion (GBM) to estimate the price of a vanilla option on a stock [wilmott2013paul]. A vanilla option is an option that gives the holder the right to buy or sell the underlying asset, a stock, at maturity for a certain price, the strike price. QLBS is one of the first approach to propose a complete RL framework for finance. As mentioned by the author, a certain number of topics are, however, not covered in the approach. For instance, it is specifically designed for vanilla options and it fails to address any other type of financial applications. Additionally, the initial generated paths rely on the popular GBM but there exist a significant number of other popular stochastic models depending on the market dynamics [hull2003options].

In this work, we describe a RL approach tailored for personal recommendation in retail banking regarding money management to be used for loan applications or credit card limits. The method is part of a banking strategy trying to reduce the customer churn in a context of a competitive retail banking market. We rely on the Q-learning algorithm and on a mean reverting diffusion process to address this topic. It leads ultimately to a fitted Q-iteration update and a model-free and off-policy setting. The diffusion process reflects the time series observed in retail banking such as transaction payments or credit card transactions. Such data is, however, strictly confidential and protected by the regulators, and therefore, it cannot be released publicly. Furthermore, we introduce a new terminal digital function, Π, defined as a Heaviside step function in its discrete form for a discrete variable n. The digital function is at the core of our approach for retail banking since it can evaluate the future probability of an event including, for instance, the future default probability of a client based on his spendings. Our method converges to an optimal policy, and to optimal sets of actions and states, respectively the spendings and the available money. The retail banks can, consequently, determine the optimal policy of money management based on the aggregated financial transactions of the clients. The banks are able to compare the difference between the MQLV’s optimal policy and the individual policy of each client. It contributes to an unbiased decision making process while offering transparency to the client. Our main contributions are summarized below:

  • A new RL framework called MQLV, Modified Q-Learning for Vasicek, extending the initial QLBS framework [halperin2017qlbs]. MQLV uses the theoretical foundation of RL learning and Q-Learning to build a financial RL framework based on a mean reverting diffusion process, the Vasicek model [vasicek1977equilibrium], to simulate data, in order to reach ultimately a model-free and off-policy RL setting.

  • The definition of a digital function to estimate the future probability of an event. The aim is to widen the application perspectives of MQLV by using a characteristic terminal function that is usable for a decision making process in retail banking such as the estimation of the default probability of a client.

  • The first application of Q-learning to determine the clients’ optimal policy of money management in retail banking. MQLV leverages the clients aggregated financial transactions to define the optimal policy of money management, targeting the risk estimation of bank loan applications or credit cards.

The paper is structured as follows. In section 2, we review QLBS and the Q-Learning formulations derived by Halperin in [halperin2017qlbs] in the context of the Black, Scholes and Merton model. In section 3, we describe MQLV according to the Q-Learning algorithm that leads to a model-free and off-policy setting. We highlight experimental results in section 4. We discuss related works in section 5 and we conclude in section 6 by addressing promising directions for future work.

2 Background

We define At𝒜 the action taken at time t for a given state Xt𝒳 and the immediate reward by Rt+1. The ongoing state is denoted by Xt𝒳 and the stochastic diffusion process by St𝒮 at time t. The discount factor that trades off the importance of immediate and later rewards is expressed by γ[0;1].

We recall a policy is a mapping from states to probabilities of selecting each possible action [sutton2018reinforcement]. By following the notations of [halperin2017qlbs], the policy π such that

π:{0,,T-1}×𝒳𝒜 (1)

maps at time t the current state Xt=xt into the action at𝒜.

at=π(t,xt) (2)

The value of a state x under a policy π, denoted by vπ(x) when starting in x and following π thereafter, is called the state-value function for policy π.

vπ=𝔼π[k=0γkRt+k+1|Xt=x] (3)

The action-value function, qπ(x,a) for policy π defines the value of taking action a in state x under a policy π as the expected return starting from x, taking the action a, and thereafter following policy π.

qπ(x,a)=𝔼π[k=0γkRt+k+1|Xt=x,At=a] (4)

The optimal policy, πt*, is the policy that maximizes the state-value function.

πt*(Xt)=argmaxπVtπ(Xt) (5)

The optimal state-value function, Vt*, satisfies the Bellman optimality equation such that

Vt*(Xt)=𝔼tπ*[Rt(Xt,ut=πt*(Xt),Xt+1)+γVt+1*(Xt+1)]. (6)

The Bellman equation for the action-value function, the Q-function, is defined as

Qtπ(x,a)=𝔼t[Rt(Xt,at,Xt+1)|Xt=x,at=a]+γ𝔼tπ[Vt+1π(Xt+1)|Xt=x]. (7)

The optimal action-value function, Qt*, is obtained for the optimal policy with

πt*=argmaxπQtπ(x,a). (8)

The optimal state-value and action-value functions are connected by the following system of equations.

{Vt*=maxaQ*(x,a)Qt*=𝔼t[Rt(Xt,a,Xt+1)]+γ𝔼t[Vt+1*(Xt+1|Xt=x)] (9)

Therefore, we can obtain the Bellman optimality equation.

Qt*(x,a)=𝔼t[Rt(Xt,at,Xt+1)+γmaxat+1𝒜Qt+1*(Xt+1,at+1)|Xt=x,at=a] (10)

Using the Robbins-Monro update [robbins1985stochastic], the update rule for the optimal Q-function with on-line Q-learning on the data point (Xt(n),at(n),Rt(n),Xt+1(n)) is expressed by the following equation with α a constant step-size parameter.

Qt*,k+1(Xt,at)=(1-αk)Qt*,k(Xt,at)+αk[Rt(Xt,at,Xt+1)+γmaxat+1𝒜Qt+1*,k(Xt+1,at+1)] (11)

3 Algorithm

We describe, in this section, how to derive a general recursive formulation for the optimal action. It is equivalent to an optimal hedge under a financial framework such as, for instance, portfolio or personal finance optimization. We additionally present the formulation of the action-value function, the Q-function. Both the optimal hedge and the Q-function follow the assumption of a continuous space scenario generated by the Vasicek model with Monte Carlo simulation.

By relying on the financial framework established in [halperin2017qlbs], we consider a mean reverting diffusion process, also known as the Vasicek model [vasicek1977equilibrium].

dSt=κ(b-St)dt+σdBt (12)

The term κ is the speed reversion, b the long term mean level, σ the volatility and Bt the Brownian motion. The solution of the stochastic equation is equal to

St=S0e-κt+b(1-e-κt)+σe-κt0teκs𝑑Bs. (13)

Therefore, we define a new time-uniform state variable, i.e. without a drift, as

{St=Xt+S0e-κt+b(1-e-κt)with Xt=σe-κt0teκs𝑑Bs-[S0e-κt+b(1-e-κt)]. (14)

Instead of estimating the price of a vanilla option as proposed in [halperin2017qlbs], we are interested to estimate the future probability of an event using the Q-learning algorithm and a digital function. First, we define the terminal condition reflecting that with the following equation

QT*(XT,aT=0)=-ΠT-λVar[ΠT(XT)] (15)

where ΠT is the digital function at time t=T defined such that

ΠT=1STK={1 if STK0 otherwise (16)

and the second term, λVar[ΠT(XT)], is a regularization term with λ+0. We use a backward loop to determine the value of Πt for t=T-1,,0.

Πt=γ(Πt+1-atΔSt)withΔSt=St+1-Stγ=St+1-erΔtSt (17)

Following the definition of the equations (6) and (17), we express the one-step time dependent random reward with respect to the cross-sectional information t as follows.

Rt(Xt,at,Xt+1)=γatΔSt(Xt,Xt+1)-λVar[Πt|t]with Var[Πt|t]=γ2𝔼t[Π^t+12-2atΔS^tΠ^t+1+at2ΔS^t2] (18)

The term ΔS¯t is defined such that ΔS¯t=1NΔS, ΔS^=ΔS-ΔS¯t and Π^t+1=Πt+1-Π¯t+1 with Π¯t+1=1NΠt+1. Because of the regularizer term, the expected reward Rt is quadratic in at and has a finite solution. Therefore, we inject the one-step time dependent random reward equation (18) into the Bellman optimality equation (10) to obtain the following Q-learning update, Q, and the optimal action, a, to be solved within a backward loop t=T-1,,0.

Qt(Xt,at)=γ𝔼t[Qt+1(Xt+1,at+1)+atΔSt]-λVar[Πt|t]at(Xt)=𝔼t[ΔS^tΠ^t+1+12λγΔSt][𝔼t[(ΔS^t)2]]-1 (19)

We refer to [halperin2017qlbs] for further details about the analytical solution, a, of the Q-learning update (19). Our approach uses the N Monte Carlo paths simultaneously to determine the optimal action a* and the optimal action-value function Q* to learn the policy π. Thus, we do not need an explicit conditioning of Xt at time t. We assume a set of basis function {Φn(x)} for which the optimal action at*(Xt) and the optimal action-value function, Qt*(Xt,at*), can be expanded.

at*(Xt)=nMϕntΦn(Xt)andQt*(Xt,at*)=nMωntΦn(Xt) (20)

The coefficients ϕ and ω are computed recursively backward in time t=T-1,,0. Subsequently, we define the minimization problem to evaluate ϕnt.

Gt(ϕ)=k=1N[-nMϕntΦn(Xtk)ΔStk+γλ(Πt+1k-nMϕntΦn(Xtk)ΔS^tk)2] (21)

The equation (21) leads to the following set of linear equations n=1,,M.

{Anm(t)=k=1NΦn(Xtk)Φm(Xtk)(ΔS^tk)2Bn(t)=k=1NΦn(Xtk)[Π^t+1kΔS^tk+12γλΔStk] with mMAnm(t)ϕmt=Bn(t) (22)

Therefore, the coefficients of the optimal action at*(Xt) is determined by

ϕt*=At-1Bt. (23)

Hereinafter, we use Fitted Q Iteration (FQI) [hasselt2010double, murphy2005generalization] to evaluate the coefficients ω. The optimal action-value function, Q*(Xt,at), is represented in its matrix form according to the basis function expansion of the equation (20).

Qt*(Xt,at)=(1,a,12at2)(W11(t)W12(t)W1M(t)W21(t)W22(t)W2M(t)W31(t)W32(t)W3M(t))(Φ1(Xt)ΦM(Xt))=AtTWtΦ(Xt)=AtTUW(t,Xt) (24)

Based on the least-square optimization problem, the coefficient Wt are determined using backpropagation t=T-1,,0 as follows

t(Wt)=k=1N(Rt(Xt,at,Xt+1)+γmaxat+1𝒜Qt+1*(Xt+1,at+1)-WtΨt(Xt,at))2with WtΨ(Xt,at)+ϵϵ0Rt(Xt,at,Xt+1)+γmaxat+1𝒜Qt+1*(Xt+1,at+1) (25)

for which we derive the following set of linear equations.

{Mn(t)=k=1NΨn(Xtk,atk)[η(Rt(Xt,at,Xt+1)+γmaxat+1𝒜Qt+1*(Xt+1,at+1))]with ηB(N,p) (26)

The term B(N,p) represents the binomial distribution for n samples with probability p. It plays the role of a dropout function when evaluating the matrix Mt to compensate the well-known drawback of the Q-learning algorithm that is the overestimation of the Q-function values. We reach finally the definition of the optimal weights to determine the optimal action a.

Wt*=St-1Mt (27)

The proposed model does not require any assumption on the dynamics of the time series, neither transition probabilities nor policy or reward functions. It is an off-policy model-free approach. The computation of the optimal policy, the optimal action and the optimal Q-function that leads to the future event probabilities is summed up in algorithm 1.

\SetAlFnt\SetAlCapFnt\SetAlCapNameFnt\SetKwFor

Casecase \SetKwFunctionKwFnprint

\setstretch1.25 \DontPrintSemicolon\KwDatatime series of maturity T, either from generated or true data \KwResultoptimal Q-function Q, optimal action a, value of digital function Π \Begin /*Condition at T*/ aT*(XT)=0 QT(XT,aT)=-ΠT=-1STK using equation (16) QT*(XT,aT*)=QT(XT,aT) /*Backward Loop*/ \FortT-1 \KwTo0 /*Evaluate the coefficients ϕ*/ compute At,Bt using equation (22) ϕt*At-1Bt /*Evaluate the coefficients ω*/ compute St,Mt using equation (26) Wt*St-1Mt at*(Xt)=nMϕnt*Φn(Xt) Q*(Xt,at)=AtTWt*Φ(Xt) /*Compute the digital function value to estimate the event probability at t=0*/ \KwFnΠ0=mean(Q0*) \KwRet
\algorithmcfname 1 Q-learning to evaluate the optimal policy of money management

4 Experiments

We empirically evaluate the performance of MQLV. We initially highlight the similarities between historical payment transactions and Vasicek generated transactions. We then underline the MQLV’s capabilities to learn the optimal policy of money management based on the estimation of future event probabilities in comparison to the closed formula of [black1973pricing, merton1973theory], hereinafter denoted by BSM’s closed formula. We rely on synthetic data sets because of the privacy and the confidentiality issues of the retail banking data sets.

Data Availability and Data Description One of our contributions is to bring a RL framework designed for retail banking. However, none of the data sets can be released publicly because of the highly sensitive information they contain. We therefore show the similarities between a small sample of anonymized transactions and Vasicek generated transactions [vasicek1977equilibrium]. We then use the Vasicek mean reverting stochastic diffusion process to generate larger synthetic data sets similar to the original retail banking data sets. The mean reverting dynamic is particularly interesting since it reflects a wide range of retail banking transactions including the credit card transactions, the savings history or the clients’ spendings. Three different data sets were generated to avoid any bias that could have been introduced by using only one data set. We choose to differentiate the number of Monte Carlo paths between the data sets to assess the influence of the sampling size on the results. The first, second and third data sets contain respectively 20,000, 30,000 and 40,000 paths. We release publicly the data sets11 1 The code and the data sets are available at https://github.com/dagrate/MQLV. to ensure the reproducibility of the experiments.

Experimental Setup and Code Availability In our experiments, we generate synthetic data sets using the Vasicek model with a parameter S0=1.0 corresponding to the value of the time series at t=0, a maturity of six months T=0.5, a speed reversion a=0.01, a long term mean b=1 and a volatility σ=0.15. Because the choice of the parameters of the Vasicek model do not have any influence on the results of the Q-learning approach, the numbers were fixed such that any limitations of the methodology would be quickly observed. The number of time steps is fixed equal to 5. We additionally use different strike values for the experiments explicitly mentioned in the Results and Discussions subsection. The simulations were performed on a computer with 16GB of RAM, Intel i7 CPU and a Tesla K80 GPU accelerator. To ensure the reproducibility of the experiments, the code is available at the following address1.

Results and Discussions about MQLV As aforementioned, we cannot release publicly an anonymized transactions data set because of privacy, confidentiality and regulatory issues. We consequently highlight the similarities between the dynamic of a small sample of anonymized transactions and Vasicek generated transactions for one client [santandercreditcards] in figure 1. The financial transactions in retail banking are periodic and often fluctuates around a long term mean, reflecting the frequency and the amounts of the spendings habits of the clients. The akin dynamic of the original and the generated transactions is highlighted by the small RMSE of 0.03. We also performed a least square calibration of the Vasicek parameters to assess the model’s plausibility. We can observe in table 1 that the Vasicek parameters have the same magnitude and, therefore, it supports the hypothesis that the Vasicek model could be used to generate synthetic transactions.

\captionoffigureSamples of original and Vasicek generated transactions for one client. The two samples oscillate around a long term mean of 1 and have a similar pattern, highlighted by the small RMSE of 0.03 in table 1.
Table 1: RMSE error between the samples of original transactions and generated Vasicek transactions of figure 1. We also calibrated the Vasicek parameters according to the original transactions to validate the model’s plausibility.
  Description     Value
RMSE 0.0335
Vasicek speed reversion a 0.5444
Vasicek long term mean b 0.9001
Vasicek volatility σ 0.2185

We present the core of our contribution in the following experiment. We aim at learning the optimal policy of money management. It is particularly interesting for bank loan applications where the differences between a client’s spendings policy and the optimal policy can be compared. We show that MQLV is capable of evaluating accurately the probability of a default event using a digital function which highlights the learning of the optimal policy of money management. Effectively, if the MQLV’s learned policy is different than the optimal policy, then the probabilities of default events are not accurate. In figure 1, the estimation of future event probabilities for different strike values is represented. We rely on the BSM’s closed formula for the vanilla option pricing [black1973pricing, merton1973theory] to approximate the digital function values [hull2003options]. We used, therefore, the BSM’s values as reference values to cross-validate the MQLV’s values. MQLV achieves a close representation of the event probabilities for the different strike values in figure 1. The curves of both the MQLV and the BSM’s approaches are similar with a RMSE of 1.5016. This result highlights that the learned Q-learning policy of MQLV is sufficiently close to the optimal policy to compute event probabilities almost identical to the probabilities of the BSM’s formula approximation.

Figure 1: Event probability values calculated by MQLV and BSM’s closed formula approximation for 40,000 Monte Carlo paths with Vasicek parameters a=0.01,b=1 and σ=0.15. The BSM’s closed formula approximation values are used as reference values. The event probabilities of MQLV are close to the BSM’s values with a total RMSE of 1.502. It illustrates that MQLV is able to learn the optimal policy leading to accurate event probabilities.

We gathered quantitative results in table 2 for a deeper analysis of the MQLV’s results. The event probability values are listed for the three data sets. We chose a set of parameters for the Vasicek model such that our configuration is free of any time-dependency. We therefore expect a probability value of 50% at a threshold of 1 because the standard deviation of the generated data sets is only induced by the normal distribution standard deviation, used to simulate the Brownian motion. Surprisingly, the MQLV values at a strike of 1 are closer to 50% than the BSM’s values for all the data sets. We can conclude, subsequently, that, for our configuration, MQLV is capable to learn the optimal policy of money management which is reflected by the accurate evaluation of the event probabilities.

Table 2: Valuation differences of the digital values for event probabilities according to different strikes between the BSM’s closed formula approximation and MQLV. Given our time-uniform configuration, the event probability values should be close to 50% for a strike value of 1. The MQLV values are close to the theoretical target of 50% at a strike of 1 highlighting the MQLV’s capabilities to learn the optimal policy. The BSM’s closed formula approximation slightly underestimates the probability values.
  Data   Number   Strike   BSM’s Approx.   MQLV   Absolute
  Set   of Paths   Values   Values (%)   Values (%)   Difference
1 20,000 0.92 76.810 77.098 0.288
1 20,000 0.98 55.447 57.920 2.473
1 20,000 1.00 47.867 50.235 2.368
1 20,000 1.02 40.509 42.865 2.356
2 30,000 0.92 76.810 76.953 0.143
2 30,000 0.98 55.447 57.760 2.313
2 30,000 1.00 47.867 50.043 2.176
2 30,000 1.02 40.509 42.744 2.235
3 40,000 0.92 76.810 77.047 0.237
3 40,000 0.98 55.447 57.491 2.044
3 40,000 1.00 47.867 49.924 2.057
3 40,000 1.02 40.509 42.713 2.204
Table 3: Event probabilities for data sets generated with different Vasicek parameters a and σ. The parameter b remains unchanged to keep a configuration free of any time-dependency to facilitate the results explainability. We can deduce that MQLV is able to learn the optimal policy because the MQLV’s probabilities are close to the theoretical target of 50% at a strike of 1. MQLV is also more accurate than BSM’s formula.
  Parameters   Number   Strike   BSM’s App.   MQLV   Absolute
a;b;σ   of Paths   Values   Values (%)   Values (%)   Difference
0.01; 1; 0.10 50,000 0.98 59.856 61.223 1.366
0.01; 1; 0.10 50,000 1.00 48.562 50.001 1.439
0.01; 1; 0.10 50,000 1.02 37.596 39.044 1.447
0.01; 1; 0.30 50,000 0.98 49.558 53.647 4.089
0.01; 1; 0.30 50,000 1.00 45.767 49.997 4.230
0.01; 1; 0.30 50,000 1.02 42.088 46.194 4.106
0.10; 1; 0.15 50,000 0.98 55.447 57.540 2.093
0.10; 1; 0.15 50,000 1.00 47.867 50.015 2.148
0.10; 1; 0.15 50,000 1.02 40.509 42.638 2.129
0.30; 1; 0.15 50,000 0.98 55.447 57.586 2.139
0.30; 1; 0.15 50,000 1.00 47.867 50.022 2.155
0.30; 1; 0.15 50,000 1.02 40.509 42.542 2.033

We chose to generate three new data sets with new Vasicek parameters a and σ to underline the potential of MQLV and the universality of the results. In table 3, we computed the event probabilities for different strikes for the newly generated data sets. The parameter b remains unchanged since we want to keep a configuration free of any time-dependency. We notice that MQLV is capable to estimate a probability of 50% for a strike of 1 which can only be obtained if MQLV is able to learn the optimal policy. We also observe that the BSM’s approximation does lead to a lower accuracy. We showed in this experiment that our model-free and off-policy RL approach, MQLV, is able to learn the optimal policy reflected by the accurate probability values independently of the data sets considered and of the Vasicek parameters.

Limitations of the BSM’s closed formula used for cross validation In our experiments, we observed, surprisingly, that the BSM’s closed formula approximation underestimates the event probability values. The volatility is the only parameter playing a significant role in the generation of the time series and, therefore, the event probability should be equal to the mean of the distribution used to generate the random numbers. The Brownian motion is simulated with a standard normal distribution with a 0.5 mean. The BSM’s closed formula did not, however, lead to a probability of 0.5 but to slightly smaller values because of the limit of their theoretical framework [black1973pricing, merton1973theory]. Hence, we observed that MQLV was more accurate than the BSM’s closed formula in our configuration.

5 Related Work

The foundations of modern reinforcement learning described in [sutton1984temporal, williams1987class] established the theoretical framework to learn good policies for sequential decision problems by proposing a formulation of cumulative future reward signal. The Q-learning algorithm introduced in [watkins1989learning] is one of the cornerstone of all recent reinforcement learning publications. However, the convergence of the Q-Learning algorithm was solved several years later. It was shown that the Q-Learning algorithm with non-linear function approximators [tsitsiklis1997analysis] with off-policy learning [baird1995residual] could provoke a divergence of the Q-network. Therefore, the reinforcement learning community focused on linear function approximators [tsitsiklis1997analysis] to ensure convergence.

The emergence of neural networks and deep learning [goodfellow2016deep] contributed to address the use of reinforcement learning with neural networks. At an early stage, deep auto-encoders were used to extract feature spaces to solve reinforcement learning tasks [lange2010deep]. Then, thanks to the release of the Atari 2600 emulator [bellemare2013arcade], a public data set was available answering the needs of the RL community for larger simulation. The Atari emulator allowed a proper performance benchmark of the different reinforcement learning algorithms and offered the possibility to test various architectures. The Atari games were used to introduce the concept of deep reinforcement learning [mnih2013playing, mnih2015human]. The authors used a convolutional neural network trained with a variant of Q-learning to successfully learn control policies directly from high dimensional sensory inputs. They reached human-level performance on many of the Atari games. Shortly after, the deep reinforcement learning was challenged by double Q-Learning within a deep reinforcement learning framework [van2016deep]. The double Q-Learning algorithm was initially introduced in [hasselt2010double] in a tabular setting. The double deep Q-Learning gave more accurate estimates and lead to much higher scores than the one observed in [mnih2013playing, mnih2015human]. Consequently, an ongoing work is to further improve the results of the double deep Q-learning algorithms through different variants. In [dabney2018implicit], the authors used a quantile regression to approximate the full quantile function for the state-action return distribution, leading to a large class of risk-sensitive policies. It allowed them to further improve the scores on the Atari 2600 games simulator. Similarly, a new algorithm, called C51, which applies the Bellman’s equation to the learning of the approximate value distribution was designed in [bellemare2017distributional]. They showed state-of-the-art results on the Atari 2600 emulator.

Other publications meanwhile focused on model-free policies and actor-critic framework. Stochastic policies were trained in [wawrzynski2013autonomous] with a replay buffer to avoid divergence. It was showed in [silver2014deterministic] that deterministic policy gradients (DPG) exist, even in a model-free environment. Subsequently, the DPG approach was extended in [balduzzi2015compatible] using a deviator network. Continuous control policies were learned using backpropagation introducing the Stochastic Value Gradient SVG(0) and SVG(1) in [heess2015learning]. Recently, Deep Deterministic Policy Gradient (DDPG) was presented in [lillicrap2015continuous] to learn competitive policies using an actor-critic model-free algorithm based on the DPG that operates over continuous action spaces.

6 Conclusion

We introduced Modified Q-Learning for Vasicek or MQLV, a new model-free and off-policy reinforcement learning approach capable of evaluating an optimal policy of money management based on the aggregated transactions of the clients. MQLV is part of a banking strategy that looks to minimize the customer churn by including more transparency and more personalization in the decision process related to bank loan applications or credit card limits. It relies on a digital function to estimate the future probability of an event such as a payment default. We discuss its relation with the Bellman optimality equation and the Q-learning update. We conducted experiments on synthetic data sets because of the privacy and confidentiality issues related to the retail banking data sets. The generated data sets followed a mean reverting stochastic diffusion process, the Vasicek model, simulating retail banking data sets such as transaction payments. Our experiments showed the performance of MQLV with respect to the BSM’s closed formula for vanilla options. We also highlighted that MQLV is able to determine an optimal policy, an optimal Q-function, optimal actions and optimal states reflected by accurate probabilities. Surprisingly, we observed that MQLV led to more accurate event probabilities than the popular BSM’s formula.

Future work will address the creation of a fully anonymized data set illustrating the retail banking daily transactions with a privacy, confidentiality and regulatory compliance. We will also evaluate the MQLV’s performance for data sets that violate the Vasicek assumptions. We, furthermore, observed that the Q-learning update could minor the real probability values for simulation involving a small temporal discretization such as Δt=200. Preliminary results showed it is provoked by the basis function approximator error. We will address this point in future research. Finally, we will extend the Q-learning update to other scheme for improved accuracy and incorporate a deep learning framework.

References