Abstract
Reinforcement learning has become one of the best approach to train acomputer game emulator capable of human level performance. In a reinforcementlearning approach, an optimal value function is learned across a set ofactions, or decisions, that leads to a set of states giving different rewards,with the objective to maximize the overall reward. A policy assigns to eachstateaction pairs an expected return. We call an optimal policy a policy forwhich the value function is optimal. QLBS, QLearner in theBlackScholes(Merton) Worlds, applies the reinforcement learning concepts, andnoticeably, the popular Qlearning algorithm, to the financial stochastic modelof Black, Scholes and Merton. It is, however, specifically optimized for thegeometric Brownian motion and the vanilla options. Its range of application is,therefore, limited to vanilla option pricing within the financial markets. Wepropose MQLV, Modified QLearner for the Vasicek model, a new reinforcementlearning approach that determines the optimal policy of money management basedon the aggregated financial transactions of the clients. It unlocks newfrontiers to establish personalized credit card limits or bank loanapplications, targeting the retail banking industry. MQLV extends thesimulation to mean reverting stochastic diffusion processes and it uses adigital function, a Heaviside step function expressed in its discrete form, toestimate the probability of a future event such as a payment default. In ourexperiments, we first show the similarities between a set of historicalfinancial transactions and Vasicek generated transactions and, then, weunderline the potential of MQLV on generated Monte Carlo simulations. Finally,MQLV is the first Qlearning Vasicekbased methodology addressing transparentdecision making processes in retail banking.
Quick Read (beta)
MQLV: Optimal Policy of Money Management in Retail Banking with QLearning
Abstract
Reinforcement learning has become one of the best approach to train a computer game emulator capable of human level performance. In a reinforcement learning approach, an optimal value function is learned across a set of actions, or decisions, that leads to a set of states giving different rewards, with the objective to maximize the overall reward. A policy assigns to each stateaction pairs an expected return. We call an optimal policy a policy for which the value function is optimal. QLBS, QLearner in the BlackScholes(Merton) Worlds, applies the reinforcement learning concepts, and noticeably, the popular Qlearning algorithm, to the financial stochastic model of Black, Scholes and Merton. It is, however, specifically optimized for the geometric Brownian motion and the vanilla options. Its range of application is, therefore, limited to vanilla option pricing within the financial markets. We propose MQLV, Modified QLearner for the Vasicek model, a new reinforcement learning approach that determines the optimal policy of money management based on the aggregated financial transactions of the clients. It unlocks new frontiers to establish personalized credit card limits or bank loan applications, targeting the retail banking industry. MQLV extends the simulation to mean reverting stochastic diffusion processes and it uses a digital function, a Heaviside step function expressed in its discrete form, to estimate the probability of a future event such as a payment default. In our experiments, we first show the similarities between a set of historical financial transactions and Vasicek generated transactions and, then, we underline the potential of MQLV on generated Monte Carlo simulations. Finally, MQLV is the first Qlearning Vasicekbased methodology addressing transparent decision making processes in retail banking.
Keywords:
QLearning Monte Carlo Payment Transactions.arrows, patterns \usetikzlibrarydecorations.pathreplacing,angles,quotes
1 Introduction
A major goal of the reinforcement learning (RL) and Machine Learning (ML) community is to build efficient representations of the current environment to solve complex tasks. In RL, an agent relies on multiple sensory inputs and past experience to derive a set of plausible actions to solve a new situation [mnih2013playing]. While the initial idea around RL is not new [sutton1984temporal, watkins1989learning, williams1987class], significant progress has been achieved recently by combining neural networks and Deep Learning (DL) with RL. The progress of DL [krizhevsky2012imagenet, sermanet2013pedestrian] has allowed the development of a novel agent combining RL with a class of deep artificial neural networks [mnih2013playing, mnih2015human] resulting in Deep Q Network (DQN). The Q refers to the Qlearning algorithm introduced in [watkins1992q]. It is an incremental method that successively improves its evaluations of the quality of the stateaction pairs. The DQN approach achieves human level performance on Atari video games using unprocessed pixels as inputs. In [van2016deep], deep RL with double QLearning was proposed to challenge the DQN approach while trying to reduce the overestimation of the action values, a wellknown drawback of the Qlearning and DQN methodologies. The extension of the DQN approach from discrete to continuous action domain, directly from the raw pixels to inputs, was successfully achieved for various simulated tasks [lillicrap2015continuous].
Nonetheless, most of the proposed models focused on gaming theory and computer game simulation and very few to the financial world. In QLBS [halperin2017qlbs], a RL approach is applied to the Black, Scholes and Merton financial framework for derivatives [black1973pricing, merton1973theory], a cornerstone of the modern quantitative finance. In the BSM model, the dynamic of a stock market is defined as following a Geometric Brownian Motion (GBM) to estimate the price of a vanilla option on a stock [wilmott2013paul]. A vanilla option is an option that gives the holder the right to buy or sell the underlying asset, a stock, at maturity for a certain price, the strike price. QLBS is one of the first approach to propose a complete RL framework for finance. As mentioned by the author, a certain number of topics are, however, not covered in the approach. For instance, it is specifically designed for vanilla options and it fails to address any other type of financial applications. Additionally, the initial generated paths rely on the popular GBM but there exist a significant number of other popular stochastic models depending on the market dynamics [hull2003options].
In this work, we describe a RL approach tailored for personal recommendation in retail banking regarding money management to be used for loan applications or credit card limits. The method is part of a banking strategy trying to reduce the customer churn in a context of a competitive retail banking market. We rely on the Qlearning algorithm and on a mean reverting diffusion process to address this topic. It leads ultimately to a fitted Qiteration update and a modelfree and offpolicy setting. The diffusion process reflects the time series observed in retail banking such as transaction payments or credit card transactions. Such data is, however, strictly confidential and protected by the regulators, and therefore, it cannot be released publicly. Furthermore, we introduce a new terminal digital function, $\mathrm{\Pi}$, defined as a Heaviside step function in its discrete form for a discrete variable $n\in \mathbb{R}$. The digital function is at the core of our approach for retail banking since it can evaluate the future probability of an event including, for instance, the future default probability of a client based on his spendings. Our method converges to an optimal policy, and to optimal sets of actions and states, respectively the spendings and the available money. The retail banks can, consequently, determine the optimal policy of money management based on the aggregated financial transactions of the clients. The banks are able to compare the difference between the MQLV’s optimal policy and the individual policy of each client. It contributes to an unbiased decision making process while offering transparency to the client. Our main contributions are summarized below:

•
A new RL framework called MQLV, Modified QLearning for Vasicek, extending the initial QLBS framework [halperin2017qlbs]. MQLV uses the theoretical foundation of RL learning and QLearning to build a financial RL framework based on a mean reverting diffusion process, the Vasicek model [vasicek1977equilibrium], to simulate data, in order to reach ultimately a modelfree and offpolicy RL setting.

•
The definition of a digital function to estimate the future probability of an event. The aim is to widen the application perspectives of MQLV by using a characteristic terminal function that is usable for a decision making process in retail banking such as the estimation of the default probability of a client.

•
The first application of Qlearning to determine the clients’ optimal policy of money management in retail banking. MQLV leverages the clients aggregated financial transactions to define the optimal policy of money management, targeting the risk estimation of bank loan applications or credit cards.
The paper is structured as follows. In section 2, we review QLBS and the QLearning formulations derived by Halperin in [halperin2017qlbs] in the context of the Black, Scholes and Merton model. In section 3, we describe MQLV according to the QLearning algorithm that leads to a modelfree and offpolicy setting. We highlight experimental results in section 4. We discuss related works in section 5 and we conclude in section 6 by addressing promising directions for future work.
2 Background
We define ${A}_{t}\in \mathcal{A}$ the action taken at time $t$ for a given state ${X}_{t}\in \mathcal{X}$ and the immediate reward by ${R}_{t+1}$. The ongoing state is denoted by ${X}_{t}\in \mathcal{X}$ and the stochastic diffusion process by ${S}_{t}\in \mathcal{S}$ at time $t$. The discount factor that trades off the importance of immediate and later rewards is expressed by $\gamma \in [0;1]$.
We recall a policy is a mapping from states to probabilities of selecting each possible action [sutton2018reinforcement]. By following the notations of [halperin2017qlbs], the policy $\pi $ such that
$$\pi :\{0,\mathrm{\dots},T1\}\times \mathcal{X}\to \mathcal{A}$$  (1) 
maps at time $t$ the current state ${X}_{t}={x}_{t}$ into the action ${a}_{t}\in \mathcal{A}$.
$${a}_{t}=\pi (t,{x}_{t})$$  (2) 
The value of a state $x$ under a policy $\pi $, denoted by ${v}_{\pi}(x)$ when starting in $x$ and following $\pi $ thereafter, is called the statevalue function for policy $\pi $.
$${v}_{\pi}={\mathbb{E}}_{\pi}[\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}{R}_{t+k+1}{X}_{t}=x]$$  (3) 
The actionvalue function, ${q}_{\pi}(x,a)$ for policy $\pi $ defines the value of taking action $a$ in state $x$ under a policy $\pi $ as the expected return starting from $x$, taking the action $a$, and thereafter following policy $\pi $.
$${q}_{\pi}(x,a)={\mathbb{E}}_{\pi}[\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}{R}_{t+k+1}{X}_{t}=x,{A}_{t}=a]$$  (4) 
The optimal policy, ${\pi}_{t}^{*}$, is the policy that maximizes the statevalue function.
$${\pi}_{t}^{*}({X}_{t})=\mathrm{arg}\underset{\pi}{\mathrm{max}}{V}_{t}^{\pi}({X}_{t})$$  (5) 
The optimal statevalue function, ${V}_{t}^{*}$, satisfies the Bellman optimality equation such that
$${V}_{t}^{*}({X}_{t})={\mathbb{E}}_{t}^{{\pi}^{*}}[{R}_{t}({X}_{t},{u}_{t}={\pi}_{t}^{*}({X}_{t}),{X}_{t+1})+\gamma {V}_{t+1}^{*}({X}_{t+1})].$$  (6) 
The Bellman equation for the actionvalue function, the Qfunction, is defined as
$${Q}_{t}^{\pi}(x,a)={\mathbb{E}}_{t}[{R}_{t}({X}_{t},{a}_{t},{X}_{t+1}){X}_{t}=x,{a}_{t}=a]+\gamma {\mathbb{E}}_{t}^{\pi}[{V}_{t+1}^{\pi}({X}_{t+1}){X}_{t}=x].$$  (7) 
The optimal actionvalue function, ${Q}_{t}^{*}$, is obtained for the optimal policy with
$${\pi}_{t}^{*}=\mathrm{arg}\underset{\pi}{\mathrm{max}}{Q}_{t}^{\pi}(x,a).$$  (8) 
The optimal statevalue and actionvalue functions are connected by the following system of equations.
$$\{\begin{array}{cc}{V}_{t}^{*}={\mathrm{max}}_{a}{Q}^{*}(x,a)\hfill & \hfill \\ {Q}_{t}^{*}={\mathbb{E}}_{t}\left[{R}_{t}({X}_{t},a,{X}_{t+1})\right]+\gamma {\mathbb{E}}_{t}\left[{V}_{t+1}^{*}({X}_{t+1}{X}_{t}=x)\right]\hfill & \hfill \end{array}$$  (9) 
Therefore, we can obtain the Bellman optimality equation.
$${Q}_{t}^{*}(x,a)={\mathbb{E}}_{t}[{R}_{t}({X}_{t},{a}_{t},{X}_{t+1})+\gamma \underset{{a}_{t+1}\in \mathcal{A}}{\mathrm{max}}{Q}_{t+1}^{*}({X}_{t+1},{a}_{t+1}){X}_{t}=x,{a}_{t}=a]$$  (10) 
Using the RobbinsMonro update [robbins1985stochastic], the update rule for the optimal Qfunction with online Qlearning on the data point $({X}_{t}^{(n)},{a}_{t}^{(n)},{R}_{t}^{(n)},{X}_{t+1}^{(n)})$ is expressed by the following equation with $\alpha $ a constant stepsize parameter.
$$\begin{array}{cc}\hfill {Q}_{t}^{*,k+1}({X}_{t},{a}_{t})=& (1{\alpha}^{k}){Q}_{t}^{*,k}({X}_{t},{a}_{t})+\hfill \\ & {\alpha}^{k}\left[{R}_{t}({X}_{t},{a}_{t},{X}_{t+1})+\gamma \underset{{a}_{t+1}\in \mathcal{A}}{\mathrm{max}}{Q}_{t+1}^{*,k}({X}_{t+1},{a}_{t+1})\right]\hfill \end{array}$$  (11) 
3 Algorithm
We describe, in this section, how to derive a general recursive formulation for the optimal action. It is equivalent to an optimal hedge under a financial framework such as, for instance, portfolio or personal finance optimization. We additionally present the formulation of the actionvalue function, the Qfunction. Both the optimal hedge and the Qfunction follow the assumption of a continuous space scenario generated by the Vasicek model with Monte Carlo simulation.
By relying on the financial framework established in [halperin2017qlbs], we consider a mean reverting diffusion process, also known as the Vasicek model [vasicek1977equilibrium].
$$d{S}_{t}=\kappa (b{S}_{t})dt+\sigma d{B}_{t}$$  (12) 
The term $\kappa $ is the speed reversion, $b$ the long term mean level, $\sigma $ the volatility and ${B}_{t}$ the Brownian motion. The solution of the stochastic equation is equal to
$${S}_{t}={S}_{0}{e}^{\kappa t}+b(1{e}^{\kappa t})+\sigma {e}^{\kappa t}{\int}_{0}^{t}{e}^{\kappa s}\mathit{d}{B}_{s}.$$  (13) 
Therefore, we define a new timeuniform state variable, i.e. without a drift, as
$$\{\begin{array}{cc}{S}_{t}={X}_{t}+{S}_{0}{e}^{\kappa t}+b(1{e}^{\kappa t})\hfill & \hfill \\ \text{with}{X}_{t}=\sigma {e}^{\kappa t}{\int}_{0}^{t}{e}^{\kappa s}\mathit{d}{B}_{s}\left[{S}_{0}{e}^{\kappa t}+b(1{e}^{\kappa t})\right]\hfill & \hfill \end{array}.$$  (14) 
Instead of estimating the price of a vanilla option as proposed in [halperin2017qlbs], we are interested to estimate the future probability of an event using the Qlearning algorithm and a digital function. First, we define the terminal condition reflecting that with the following equation
$${Q}_{T}^{*}({X}_{T},{a}_{T}=0)={\mathrm{\Pi}}_{T}\lambda Var\left[{\mathrm{\Pi}}_{T}({X}_{T})\right]$$  (15) 
where ${\mathrm{\Pi}}_{T}$ is the digital function at time $t=T$ defined such that
$${\mathrm{\Pi}}_{T}={1}_{{S}_{T}\ge K}=\{\begin{array}{cc}\hfill \hfill & \hfill 1\text{if}{S}_{T}\ge K\hfill \\ \hfill \hfill & \hfill 0\text{otherwise}\hfill \end{array}$$  (16) 
and the second term, $\lambda Var\left[{\mathrm{\Pi}}_{T}({X}_{T})\right]$, is a regularization term with $\lambda \in {\mathbb{R}}^{+}\ll 0$. We use a backward loop to determine the value of ${\mathrm{\Pi}}_{t}$ for $t=T1,\mathrm{\dots},0$.
$${\mathrm{\Pi}}_{t}=\gamma \left({\mathrm{\Pi}}_{t+1}{a}_{t}\mathrm{\Delta}{S}_{t}\right)\mathit{\hspace{1em}}\text{with}\mathit{\hspace{1em}}\mathrm{\Delta}{S}_{t}={S}_{t+1}\frac{{S}_{t}}{\gamma}={S}_{t+1}{e}^{r\mathrm{\Delta}t}{S}_{t}$$  (17) 
Following the definition of the equations (6) and (17), we express the onestep time dependent random reward with respect to the crosssectional information ${\mathcal{F}}_{t}$ as follows.
$$\begin{array}{cc}\hfill {R}_{t}({X}_{t},{a}_{t},{X}_{t+1})& =\gamma {a}_{t}\mathrm{\Delta}{S}_{t}({X}_{t},{X}_{t+1})\lambda Var\left[{\mathrm{\Pi}}_{t}{\mathcal{F}}_{t}\right]\hfill \\ & \text{with}Var\left[{\mathrm{\Pi}}_{t}{\mathcal{F}}_{t}\right]={\gamma}^{2}{\mathbb{E}}_{t}\left[{\widehat{\mathrm{\Pi}}}_{t+1}^{2}2{a}_{t}\mathrm{\Delta}{\widehat{S}}_{t}{\widehat{\mathrm{\Pi}}}_{t+1}+{a}_{t}^{2}\mathrm{\Delta}{\widehat{S}}_{t}^{2}\right]\hfill \end{array}$$  (18) 
The term $\mathrm{\Delta}{\overline{S}}_{t}$ is defined such that $\mathrm{\Delta}{\overline{S}}_{t}=\frac{1}{N}\mathrm{\Delta}S$, $\mathrm{\Delta}\widehat{S}=\mathrm{\Delta}S\mathrm{\Delta}{\overline{S}}_{t}$ and ${\widehat{\mathrm{\Pi}}}_{t+1}={\mathrm{\Pi}}_{t+1}{\overline{\mathrm{\Pi}}}_{t+1}$ with ${\overline{\mathrm{\Pi}}}_{t+1}=\frac{1}{N}{\mathrm{\Pi}}_{t+1}$. Because of the regularizer term, the expected reward ${R}_{t}$ is quadratic in ${a}_{t}$ and has a finite solution. Therefore, we inject the onestep time dependent random reward equation (18) into the Bellman optimality equation (10) to obtain the following Qlearning update, ${Q}^{\ast}$, and the optimal action, ${a}^{\ast}$, to be solved within a backward loop $\forall t=T1,\mathrm{\dots},0$.
$$\begin{array}{cc}\hfill {Q}_{t}^{\ast}({X}_{t},{a}_{t})=& \gamma {\mathbb{E}}_{t}\left[{Q}_{t+1}^{\ast}({X}_{t+1},{a}_{t+1}^{\ast})+{a}_{t}\mathrm{\Delta}{S}_{t}\right]\lambda Var\left[{\mathrm{\Pi}}_{t}{\mathcal{F}}_{t}\right]\hfill \\ \hfill {a}_{t}^{\ast}({X}_{t})=& {\mathbb{E}}_{t}\left[\mathrm{\Delta}{\widehat{S}}_{t}{\widehat{\mathrm{\Pi}}}_{t+1}+\frac{1}{2\lambda \gamma}\mathrm{\Delta}{S}_{t}\right]{\left[{\mathbb{E}}_{t}\left[{\left(\mathrm{\Delta}{\widehat{S}}_{t}\right)}^{2}\right]\right]}^{1}\hfill \end{array}$$  (19) 
We refer to [halperin2017qlbs] for further details about the analytical solution, ${a}^{\ast}$, of the Qlearning update (19). Our approach uses the $N$ Monte Carlo paths simultaneously to determine the optimal action ${a}^{*}$ and the optimal actionvalue function ${Q}^{*}$ to learn the policy ${\pi}^{\ast}$. Thus, we do not need an explicit conditioning of ${X}_{t}$ at time $t$. We assume a set of basis function $\{{\mathrm{\Phi}}_{n}(x)\}$ for which the optimal action ${a}_{t}^{*}({X}_{t})$ and the optimal actionvalue function, ${Q}_{t}^{*}({X}_{t},{a}_{t}^{*})$, can be expanded.
$${a}_{t}^{*}({X}_{t})=\sum _{n}^{M}{\varphi}_{nt}{\mathrm{\Phi}}_{n}({X}_{t})\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{Q}_{t}^{*}({X}_{t},{a}_{t}^{*})=\sum _{n}^{M}{\omega}_{nt}{\mathrm{\Phi}}_{n}({X}_{t})$$  (20) 
The coefficients $\varphi $ and $\omega $ are computed recursively backward in time $\forall t=T1,\mathrm{\dots},0$. Subsequently, we define the minimization problem to evaluate ${\varphi}_{nt}$.
$${G}_{t}(\varphi )=\sum _{k=1}^{N}\left[\sum _{n}^{M}{\varphi}_{nt}{\mathrm{\Phi}}_{n}({X}_{t}^{k})\mathrm{\Delta}{S}_{t}^{k}+\gamma \lambda {\left({\mathrm{\Pi}}_{t+1}^{k}\sum _{n}^{M}{\varphi}_{nt}{\mathrm{\Phi}}_{n}({X}_{t}^{k})\mathrm{\Delta}{\widehat{S}}_{t}^{k}\right)}^{2}\right]$$  (21) 
The equation (21) leads to the following set of linear equations $\forall n=1,\mathrm{\dots},M$.
$$\begin{array}{c}\hfill \{\begin{array}{cc}{A}_{nm}^{(t)}=\sum _{k=1}^{N}{\mathrm{\Phi}}_{n}({X}_{t}^{k}){\mathrm{\Phi}}_{m}({X}_{t}^{k}){(\mathrm{\Delta}{\widehat{S}}_{{t}^{k}})}^{2}\hfill & \hfill \\ {B}_{n}^{(t)}=\sum _{k=1}^{N}{\mathrm{\Phi}}_{n}({X}_{t}^{k})\left[{\widehat{\mathrm{\Pi}}}_{t+1}^{k}\mathrm{\Delta}{\widehat{S}}_{t}^{k}+\frac{1}{2\gamma \lambda}\mathrm{\Delta}{S}_{t}^{k}\right]\hfill & \hfill \end{array}\text{with}\sum _{m}^{M}{A}_{nm}^{(t)}{\varphi}_{mt}={B}_{n}^{(t)}\end{array}$$  (22) 
Therefore, the coefficients of the optimal action ${a}_{t}^{*}({X}_{t})$ is determined by
$${\varphi}_{t}^{*}={A}_{t}^{1}{B}_{t}.$$  (23) 
Hereinafter, we use Fitted Q Iteration (FQI) [hasselt2010double, murphy2005generalization] to evaluate the coefficients $\omega $. The optimal actionvalue function, ${Q}^{*}({X}_{t},{a}_{t})$, is represented in its matrix form according to the basis function expansion of the equation (20).
$$\begin{array}{cc}\hfill {Q}_{t}^{*}({X}_{t},{a}_{t})=& (1,a,\frac{1}{2}{a}_{t}^{2})\left(\begin{array}{cccc}\hfill {W}_{11}(t)\hfill & \hfill {W}_{12}(t)\hfill & \hfill \mathrm{\dots}\hfill & \hfill {W}_{1M}(t)\hfill \\ \hfill {W}_{21}(t)\hfill & \hfill {W}_{22}(t)\hfill & \hfill \mathrm{\dots}\hfill & \hfill {W}_{2M}(t)\hfill \\ \hfill {W}_{31}(t)\hfill & \hfill {W}_{32}(t)\hfill & \hfill \mathrm{\dots}\hfill & \hfill {W}_{3M}(t)\hfill \end{array}\right)\left(\begin{array}{c}\hfill {\mathrm{\Phi}}_{1}({X}_{t})\hfill \\ \hfill \mathrm{\vdots}\hfill \\ \hfill {\mathrm{\Phi}}_{M}({X}_{t})\hfill \end{array}\right)\hfill \\ \hfill =& {A}_{t}^{T}{W}_{t}\mathrm{\Phi}({X}_{t})={A}_{t}^{T}{U}_{W}(t,{X}_{t})\hfill \end{array}$$  (24) 
Based on the leastsquare optimization problem, the coefficient ${W}_{t}$ are determined using backpropagation $\forall t=T1,\mathrm{\dots},0$ as follows
$$\begin{array}{cc}\hfill {\mathcal{L}}_{t}({W}_{t})& =\sum _{k=1}^{N}{\left({R}_{t}({X}_{t},{a}_{t},{X}_{t+1})+\gamma \underset{{a}_{t+1}\in \mathcal{A}}{\mathrm{max}}{Q}_{t+1}^{*}({X}_{t+1},{a}_{t+1}){W}_{t}{\mathrm{\Psi}}_{t}({X}_{t},{a}_{t})\right)}^{2}\hfill \\ & \text{with}{W}_{t}\mathrm{\Psi}({X}_{t},{a}_{t})+\u03f5\underset{\u03f5\to 0}{\u27f6}{R}_{t}({X}_{t},{a}_{t},{X}_{t+1})+\gamma \underset{{a}_{t+1}\in \mathcal{A}}{\mathrm{max}}{Q}_{t+1}^{*}({X}_{t+1},{a}_{t+1})\hfill \end{array}$$  (25) 
for which we derive the following set of linear equations.
$$\{\begin{array}{cc}{M}_{n}^{(t)}=\sum _{k=1}^{N}{\mathrm{\Psi}}_{n}({X}_{t}^{k},{a}_{t}^{k})\left[\eta \left({R}_{t}({X}_{t},{a}_{t},{X}_{t+1})+\gamma \underset{{a}_{t+1}\in \mathcal{A}}{\mathrm{max}}{Q}_{t+1}^{*}({X}_{t+1},{a}_{t+1})\right)\right]\hfill & \hfill \\ \text{with}\eta \sim B(N,p)\hfill & \hfill \end{array}$$  (26) 
The term $B(N,p)$ represents the binomial distribution for $n$ samples with probability $p$. It plays the role of a dropout function when evaluating the matrix ${M}_{t}$ to compensate the wellknown drawback of the Qlearning algorithm that is the overestimation of the Qfunction values. We reach finally the definition of the optimal weights to determine the optimal action ${a}^{\ast}$.
$${W}_{t}^{*}={S}_{t}^{1}{M}_{t}$$  (27) 
The proposed model does not require any assumption on the dynamics of the time series, neither transition probabilities nor policy or reward functions. It is an offpolicy modelfree approach. The computation of the optimal policy, the optimal action and the optimal Qfunction that leads to the future event probabilities is summed up in algorithm 1.
Casecase \SetKwFunctionKwFnprint
4 Experiments
We empirically evaluate the performance of MQLV. We initially highlight the similarities between historical payment transactions and Vasicek generated transactions. We then underline the MQLV’s capabilities to learn the optimal policy of money management based on the estimation of future event probabilities in comparison to the closed formula of [black1973pricing, merton1973theory], hereinafter denoted by BSM’s closed formula. We rely on synthetic data sets because of the privacy and the confidentiality issues of the retail banking data sets.
Data Availability and Data Description
One of our contributions is to bring a RL framework designed for retail banking. However, none of the data sets can be released publicly because of the highly sensitive information they contain. We therefore show the similarities between a small sample of anonymized transactions and Vasicek generated transactions [vasicek1977equilibrium]. We then use the Vasicek mean reverting stochastic diffusion process to generate larger synthetic data sets similar to the original retail banking data sets. The mean reverting dynamic is particularly interesting since it reflects a wide range of retail banking transactions including the credit card transactions, the savings history or the clients’ spendings. Three different data sets were generated to avoid any bias that could have been introduced by using only one data set. We choose to differentiate the number of Monte Carlo paths between the data sets to assess the influence of the sampling size on the results. The first, second and third data sets contain respectively 20,000, 30,000 and 40,000 paths. We release publicly the data sets^{1}^{1}
1
The code and the data sets are available at https://github.com/dagrate/MQLV. to ensure the reproducibility of the experiments.
Experimental Setup and Code Availability
In our experiments, we generate synthetic data sets using the Vasicek model with a parameter ${S}_{0}=1.0$ corresponding to the value of the time series at $t=0$, a maturity of six months $T=0.5$, a speed reversion $a=0.01$, a long term mean $b=1$ and a volatility $\sigma =0.15$. Because the choice of the parameters of the Vasicek model do not have any influence on the results of the Qlearning approach, the numbers were fixed such that any limitations of the methodology would be quickly observed. The number of time steps is fixed equal to 5. We additionally use different strike values for the experiments explicitly mentioned in the Results and Discussions subsection. The simulations were performed on a computer with 16GB of RAM, Intel i7 CPU and a Tesla K80 GPU accelerator. To ensure the reproducibility of the experiments, the code is available at the following address^{1}.
Results and Discussions about MQLV
As aforementioned, we cannot release publicly an anonymized transactions data set because of privacy, confidentiality and regulatory issues. We consequently highlight the similarities between the dynamic of a small sample of anonymized transactions and Vasicek generated transactions for one client [santandercreditcards] in figure 1. The financial transactions in retail banking are periodic and often fluctuates around a long term mean, reflecting the frequency and the amounts of the spendings habits of the clients. The akin dynamic of the original and the generated transactions is highlighted by the small RMSE of 0.03. We also performed a least square calibration of the Vasicek parameters to assess the model’s plausibility. We can observe in table 1 that the Vasicek parameters have the same magnitude and, therefore, it supports the hypothesis that the Vasicek model could be used to generate synthetic transactions.
Description  Value 
RMSE  0.0335 
Vasicek speed reversion $a$  0.5444 
Vasicek long term mean $b$  0.9001 
Vasicek volatility $\sigma $  0.2185 
We present the core of our contribution in the following experiment. We aim at learning the optimal policy of money management. It is particularly interesting for bank loan applications where the differences between a client’s spendings policy and the optimal policy can be compared. We show that MQLV is capable of evaluating accurately the probability of a default event using a digital function which highlights the learning of the optimal policy of money management. Effectively, if the MQLV’s learned policy is different than the optimal policy, then the probabilities of default events are not accurate. In figure 1, the estimation of future event probabilities for different strike values is represented. We rely on the BSM’s closed formula for the vanilla option pricing [black1973pricing, merton1973theory] to approximate the digital function values [hull2003options]. We used, therefore, the BSM’s values as reference values to crossvalidate the MQLV’s values. MQLV achieves a close representation of the event probabilities for the different strike values in figure 1. The curves of both the MQLV and the BSM’s approaches are similar with a RMSE of 1.5016. This result highlights that the learned Qlearning policy of MQLV is sufficiently close to the optimal policy to compute event probabilities almost identical to the probabilities of the BSM’s formula approximation.
We gathered quantitative results in table 2 for a deeper analysis of the MQLV’s results. The event probability values are listed for the three data sets. We chose a set of parameters for the Vasicek model such that our configuration is free of any timedependency. We therefore expect a probability value of 50% at a threshold of 1 because the standard deviation of the generated data sets is only induced by the normal distribution standard deviation, used to simulate the Brownian motion. Surprisingly, the MQLV values at a strike of 1 are closer to 50% than the BSM’s values for all the data sets. We can conclude, subsequently, that, for our configuration, MQLV is capable to learn the optimal policy of money management which is reflected by the accurate evaluation of the event probabilities.
Data  Number  Strike  BSM’s Approx.  MQLV  Absolute 
Set  of Paths  Values  Values (%)  Values (%)  Difference 
1  20,000  0.92  76.810  77.098  0.288 
1  20,000  0.98  55.447  57.920  2.473 
1  20,000  1.00  47.867  50.235  2.368 
1  20,000  1.02  40.509  42.865  2.356 
2  30,000  0.92  76.810  76.953  0.143 
2  30,000  0.98  55.447  57.760  2.313 
2  30,000  1.00  47.867  50.043  2.176 
2  30,000  1.02  40.509  42.744  2.235 
3  40,000  0.92  76.810  77.047  0.237 
3  40,000  0.98  55.447  57.491  2.044 
3  40,000  1.00  47.867  49.924  2.057 
3  40,000  1.02  40.509  42.713  2.204 
Parameters  Number  Strike  BSM’s App.  MQLV  Absolute 
$a;b;\sigma $  of Paths  Values  Values (%)  Values (%)  Difference 
0.01; 1; 0.10  50,000  0.98  59.856  61.223  1.366 
0.01; 1; 0.10  50,000  1.00  48.562  50.001  1.439 
0.01; 1; 0.10  50,000  1.02  37.596  39.044  1.447 
0.01; 1; 0.30  50,000  0.98  49.558  53.647  4.089 
0.01; 1; 0.30  50,000  1.00  45.767  49.997  4.230 
0.01; 1; 0.30  50,000  1.02  42.088  46.194  4.106 
0.10; 1; 0.15  50,000  0.98  55.447  57.540  2.093 
0.10; 1; 0.15  50,000  1.00  47.867  50.015  2.148 
0.10; 1; 0.15  50,000  1.02  40.509  42.638  2.129 
0.30; 1; 0.15  50,000  0.98  55.447  57.586  2.139 
0.30; 1; 0.15  50,000  1.00  47.867  50.022  2.155 
0.30; 1; 0.15  50,000  1.02  40.509  42.542  2.033 
We chose to generate three new data sets with new Vasicek parameters $a$ and $\sigma $ to underline the potential of MQLV and the universality of the results. In table 3, we computed the event probabilities for different strikes for the newly generated data sets. The parameter $b$ remains unchanged since we want to keep a configuration free of any timedependency. We notice that MQLV is capable to estimate a probability of 50% for a strike of 1 which can only be obtained if MQLV is able to learn the optimal policy. We also observe that the BSM’s approximation does lead to a lower accuracy. We showed in this experiment that our modelfree and offpolicy RL approach, MQLV, is able to learn the optimal policy reflected by the accurate probability values independently of the data sets considered and of the Vasicek parameters.
Limitations of the BSM’s closed formula used for cross validation In our experiments, we observed, surprisingly, that the BSM’s closed formula approximation underestimates the event probability values. The volatility is the only parameter playing a significant role in the generation of the time series and, therefore, the event probability should be equal to the mean of the distribution used to generate the random numbers. The Brownian motion is simulated with a standard normal distribution with a 0.5 mean. The BSM’s closed formula did not, however, lead to a probability of 0.5 but to slightly smaller values because of the limit of their theoretical framework [black1973pricing, merton1973theory]. Hence, we observed that MQLV was more accurate than the BSM’s closed formula in our configuration.
5 Related Work
The foundations of modern reinforcement learning described in [sutton1984temporal, williams1987class] established the theoretical framework to learn good policies for sequential decision problems by proposing a formulation of cumulative future reward signal. The Qlearning algorithm introduced in [watkins1989learning] is one of the cornerstone of all recent reinforcement learning publications. However, the convergence of the QLearning algorithm was solved several years later. It was shown that the QLearning algorithm with nonlinear function approximators [tsitsiklis1997analysis] with offpolicy learning [baird1995residual] could provoke a divergence of the Qnetwork. Therefore, the reinforcement learning community focused on linear function approximators [tsitsiklis1997analysis] to ensure convergence.
The emergence of neural networks and deep learning [goodfellow2016deep] contributed to address the use of reinforcement learning with neural networks. At an early stage, deep autoencoders were used to extract feature spaces to solve reinforcement learning tasks [lange2010deep]. Then, thanks to the release of the Atari 2600 emulator [bellemare2013arcade], a public data set was available answering the needs of the RL community for larger simulation. The Atari emulator allowed a proper performance benchmark of the different reinforcement learning algorithms and offered the possibility to test various architectures. The Atari games were used to introduce the concept of deep reinforcement learning [mnih2013playing, mnih2015human]. The authors used a convolutional neural network trained with a variant of Qlearning to successfully learn control policies directly from high dimensional sensory inputs. They reached humanlevel performance on many of the Atari games. Shortly after, the deep reinforcement learning was challenged by double QLearning within a deep reinforcement learning framework [van2016deep]. The double QLearning algorithm was initially introduced in [hasselt2010double] in a tabular setting. The double deep QLearning gave more accurate estimates and lead to much higher scores than the one observed in [mnih2013playing, mnih2015human]. Consequently, an ongoing work is to further improve the results of the double deep Qlearning algorithms through different variants. In [dabney2018implicit], the authors used a quantile regression to approximate the full quantile function for the stateaction return distribution, leading to a large class of risksensitive policies. It allowed them to further improve the scores on the Atari 2600 games simulator. Similarly, a new algorithm, called C51, which applies the Bellman’s equation to the learning of the approximate value distribution was designed in [bellemare2017distributional]. They showed stateoftheart results on the Atari 2600 emulator.
Other publications meanwhile focused on modelfree policies and actorcritic framework. Stochastic policies were trained in [wawrzynski2013autonomous] with a replay buffer to avoid divergence. It was showed in [silver2014deterministic] that deterministic policy gradients (DPG) exist, even in a modelfree environment. Subsequently, the DPG approach was extended in [balduzzi2015compatible] using a deviator network. Continuous control policies were learned using backpropagation introducing the Stochastic Value Gradient SVG(0) and SVG(1) in [heess2015learning]. Recently, Deep Deterministic Policy Gradient (DDPG) was presented in [lillicrap2015continuous] to learn competitive policies using an actorcritic modelfree algorithm based on the DPG that operates over continuous action spaces.
6 Conclusion
We introduced Modified QLearning for Vasicek or MQLV, a new modelfree and offpolicy reinforcement learning approach capable of evaluating an optimal policy of money management based on the aggregated transactions of the clients. MQLV is part of a banking strategy that looks to minimize the customer churn by including more transparency and more personalization in the decision process related to bank loan applications or credit card limits. It relies on a digital function to estimate the future probability of an event such as a payment default. We discuss its relation with the Bellman optimality equation and the Qlearning update. We conducted experiments on synthetic data sets because of the privacy and confidentiality issues related to the retail banking data sets. The generated data sets followed a mean reverting stochastic diffusion process, the Vasicek model, simulating retail banking data sets such as transaction payments. Our experiments showed the performance of MQLV with respect to the BSM’s closed formula for vanilla options. We also highlighted that MQLV is able to determine an optimal policy, an optimal Qfunction, optimal actions and optimal states reflected by accurate probabilities. Surprisingly, we observed that MQLV led to more accurate event probabilities than the popular BSM’s formula.
Future work will address the creation of a fully anonymized data set illustrating the retail banking daily transactions with a privacy, confidentiality and regulatory compliance. We will also evaluate the MQLV’s performance for data sets that violate the Vasicek assumptions. We, furthermore, observed that the Qlearning update could minor the real probability values for simulation involving a small temporal discretization such as $\mathrm{\Delta}t=200$. Preliminary results showed it is provoked by the basis function approximator error. We will address this point in future research. Finally, we will extend the Qlearning update to other scheme for improved accuracy and incorporate a deep learning framework.