Variational Quantum Circuits and Deep Reinforcement Learning

  • 2019-06-30 15:35:07
  • Samuel Yen-Chi Chen, Hsi-Sheng Goan
  • 0

Abstract

Recently, machine learning has prevailed in many academia and industrialapplications. At the same time, quantum computing, once seen as not realizable,has been brought to markets by several tech giants. However, these machines arenot fault-tolerant and can not execute very deep circuits. Therefore, it isurgent to design suitable algorithms and applications implementable on thesemachines. In this work, we demonstrate a novel approach which appliesvariational quantum circuits to deep reinforcement learning. With the proposedmethod, we can implement famous deep reinforcement learning algorithms such asexperience replay and target network with variational quantum circuits. In thisframework, with appropriate information encoding scheme, the possible quantumadvantage is the number of circuit parameters with $poly(\log{} N)$ compared to$poly(N)$ in conventional neural network where $N$ is the dimension of inputvectors. Such an approach can be deployed on near-term noisy intermediate-scalequantum machines.

 

Quick Read (beta)

Variational Quantum Circuits and Deep Reinforcement Learning

Samuel Yen-Chi Chen
Department of Physics
National Taiwan University
Taipei, Taiwan
[email protected]
&Hsi-Sheng Goan
Department of Physics
National Taiwan University
Taipei, Taiwan
[email protected]
Abstract

Recently, machine learning has prevailed in many academia and industrial applications. At the same time, quantum computing, once seen as not realizable, has been brought to markets by several tech giants. However, these machines are not fault-tolerant and can not execute very deep circuits. Therefore, it is urgent to design suitable algorithms and applications implementable on these machines. In this work, we demonstrate a novel approach which applies variational quantum circuits to deep reinforcement learning. With the proposed method, we can implement famous deep reinforcement learning algorithms such as experience replay and target network with variational quantum circuits. In this framework, with appropriate information encoding scheme, the possible quantum advantage is the number of circuit parameters with poly(logN) compared to poly(N) in conventional neural network where N is the dimension of input vectors. Such an approach can be deployed on near-term noisy intermediate-scale quantum machines.

 

Variational Quantum Circuits and Deep Reinforcement Learning


 
Samuel Yen-Chi Chen Department of Physics National Taiwan University Taipei, Taiwan [email protected] Hsi-Sheng Goan Department of Physics National Taiwan University Taipei, Taiwan [email protected]


Keywords Variational Quantum Circuit   Quantum Circuit Learning   Deep Reinforcement Learning   Machine Learning

1 Introduction

Machine Learning has grown to a vast subject in the recent decade and has already prevailed in many academic fields and has successful commercial applications. It has been successful in computer vision [32] [36] [39] , natural language processing [34], financial trading [12] [15] [28] [29] [33] and even the game of Go[30] which is previously considered formidable for machines. It also helped basics physics research such as quantum many-body physics [4] [8] [9] [16] [21] [24] [38] [37] , phase-transitions [7], quantum control [1] [14] and quantum error correction [2] [22]. In the meantime, quantum computing, previously thought to be not realizable, has been brought to the market by several startups and large tech companies. However, large circuits still cannot be deployed on these machines due to the lack of quantum error correction. Therefore, it is urgent to develop algorithms and applications which can be implemented on such devices. In view of the continuously increasing demand for machine learning applications in the near future and also the quest for more advanced artificial intelligence, this research is targeted at reinforcement learning, which is referred to an agent interacting with the environment to gain the knowledge of surroundings and then derive the policy of decision-making accordingly [35]. This kind of learning is the form which can be regarded as intelligence. The present work focuses on the applications of variational quantum circuits in reinforcement learning, especially in action-value function approximation, which is currently based on deep neural networks [19] [20]. With appropriate circuit design and encoding scheme, it is possible to have quantum advantages with less parameters compared to conventional neural networks because of quantum entanglement. Such circuits will be implementable on currently commercially available quantum computers which are Noisy Intermediate-Scale Quantum machines(NISQ) and will show invaluable applications in both academia and industries.

2 Methods

2.1 Reinforcement Learning

Reinforcement learning is a machine learning paradigm which an agent interacting with an environment over a number of discrete time steps[35]. At each time step t, the agent receives a state or observation st and then chooses an action at from the set of possible actions 𝒜 according to its policy π, where the policy is a function mapping the state st to action at. After executing the action at, the agent receives the state of next time step st+1 and a scalar reward rt. The process continues until the agent reach the terminal state. An episode is defined as an agent starting from a randomly selected initial state and following the above-mentioned process until a terminal state.

The return Rt=t=tTγt-trt is the total discounted return from time step t, where γ is the discount factor lies in (0,1]. The discount factor controls the importance of future rewards. For large γ, the agent weights future reward more and vice versa. The agent is to be trained to maximize the expected return from each state st. The action-value function or Q-value function Qπ(s,a)=𝔼[Rt|st=s,a] is the expected return for selecting action a in state s following the policy π. The optimal action value function Q*(s,a)=maxπQπ(s,a) gives the maximal action-value across all possible policy. The value of state s under policy π is Vπ(s)=𝔼[Rt|st=s], it is the expected return by following policy π from state s.

2.1.1 Deep Q-Learning

The action-value function Q(s,a) can be explicitly represented by a two dimensional table with total number of entries s×a. However, when the state space or the action space is large or even continuous the tabular method is unfeasible. In this situation, the action-value function is represented with function approximators such as neural networks[19][20].

Using neural networks as function approximators to represent the Q-value function has been studied extensively [19][20] and has already gained great success in a variety of tasks such as playing video games. In this setting, the action-value function Q(s,a;θ) is parameterized by θ, the θ can be derived by a series of iterations from a variety of optimization methods adapted from other machine learning tasks. The simplest form is the Q-learning. In this method, the goal is to directly approximate the optimal action-value function Q*(s,a) with the loss function

L(θ)=𝔼[(rt+γmaxaQ(s,a;θ-)-Q(s,a;θ))2] (1)

The target is rt+γmaxaQ(s,a;θ-) and the prediction is Q(s,a;θ), where s is the state encountered after playing action at at state s.

Reinforcement learning is known to be hard to converge or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action-value function [20]. There are several reasons to blame: The first is the correlation among the states or observations. It is obvious that each state or observation on a trajectory will be strongly correlated to each other. With this kind of training input, which is in general not iid sampling, the Q function may change dramatically, and therefore change the policy at relatively large scale. The second reason may lie in the fact that there is also large correlation between the action-value Q and the target values rt+γmaxaQ(s,a). In supervised learning, the targets are labels for the given inputs, and these labels will not change from time to time. However, under the setting of deep reinforcement learning, the target changes accordingly with the Q, causing the Q to be chasing an unstationary target.

The deep Q-learning presented in the work [20] addressed these issues through the following two mechanisms: Experience replay: To perform experience replay, we store each transition the agent encounters. The transition is stored as a tuple in the following form: (st,at,rt,st+1) at each time step t. When updating Q-learning parameters, randomly samples a batch of experiences from the replay memory and then perform gradient descent with the following loss function: L(θ)=𝔼replaybatch[(r+γmaxaQ(s,a;θ-)-Q(s,a;θ))2]. The key importance of experience replay is to lower the correlation of inputs for training the Q-function.
Target Network: θ- is the paramater of the target network and these parameters are only updated every C steps. This setting helps stabilizing the Q-value function training since the target is relatively stationary compared to the action-value function.

2.2 Variational Quantum Circuits

Variational quantum circuit is a kind of quantum circuits with tunable parameters subject to iterative optimization. These parameters can be seen as the weights in artificial neural networks. Due to the lack of quantum error correction in the near-term quantum devices, these variational circuits can relatively be robust against noises since these deviations are absorbed into the parameters during iterative optimization process. Previous work [13] [18] [26] have already demonstrate the capability of variational circuits representing function approximators, classifiers and even modeling quantum-many-body physics which are intractable on classical computers. For example, in the work [18], it has been shown that variational quantum circuits can be used to approximate analytical functions f(x) and such kind of quantum circuits are in general hard to be simulated by classical computers, which implied more expressive power than classical function approximators like neural network.

2.3 Variational Quantum Deep Q Learning

{algorithm}

[t] \[email protected]@algorithmic \StateInitialize replay memory 𝒟 to capacity N \StateInitialize action-value function circuit Q with random parameters \Forepisode =1,M \StateInitialise state s1 \Fort=1,T \StateWith probability ϵ select a random action at \Stateotherwise select at=maxaQ*(st,a;θ) \StateExecute action at in emulator and observe reward rt and next state st+1 \StateStore transition (st,at,rt,st+1) in 𝒟 \StateSample random minibatch of transitions (sj,aj,rj,sj+1) from 𝒟 \StateSet yj={rjfor terminal sj+1rj+γmaxaQ(sj+1,a;θ)for non-terminal sj+1 \StatePerform a gradient descent step on (yj-Q(sj,aj;θ))2 \EndFor\EndFor Variational Quantum Deep Q Learning In this work, we explore the possibility of expanding the capability of variational circuits to represent the action-value function. With the possible quantum advantages of having less parameters needed than conventional neural network, which is promising for modeling more complex environments. For target network, we construct two sets of circuit parameters while with the same circuit architecture. The target circuit is updated every 20 steps. For experience replay, the replay memory is set with the length of 80 and the training batch size is 5. The optimization in this work needs to calculate gradients of expectation values of quantum measurements, which can be done with the same circuit architecture and slightly different parameters [25]. We encode the state with computational basis encoding. In the FrozenLake environment, there are total 16 states, thus it needs 4 qubits to represent all states. For a general discussion about different encoding scheme, please refer to [27]

2.3.1 Computational Basis Encoding

For a general n-qubit state, it can be written as

|ψ=(q1,q2,,qn){0,1}ncq1,,qn|q1|q2|q3|qn

where cq1,,qn is the amplitude of each quantum state and each qn{0,1}.

The square of the amplitude is the probability of measurement with the output |q1|q2|q3|qn and the total probability should sum to 1.

(q1,q2,,qn){0,1}n||cq1,,qn||2=1

For the states in the environment considered in this work, each is labeled with a integer number in the range from 0 to 15. Each decimal number is first converted into a binary number and then encoded in the quantum states. For example, the state observed by the agent, say, 12, is first converted to binary number 1100. It is denoted the binary number as b1b2b3b4, therefore, the state indexed by the number 12 will be denoted as b1=1,b2=1,b3=0,b4=0. The quantum state for this can be denoted as |1|1|0|0. Next step is to construct the appropriate rotation to encode this state into the quantum circuit Fig1. The rotation angles are:

θi =π×bi
ϕi =π×bi

where i represents the index of each qubit, in this work, the total number of qubits is 4 therefore the index is the set {1,2,3,4}.

2.4 Numerical Simulation

We set up the experiment as follows Fig1. The quantum circuit as shown in the figure is numerically simulated with the software package PennyLane [3], in order to accelerate the simulation, the linear algebra manipulation is helped with the well-known package PyTorch [23].

The testing environment for all reinforcement learning agent is provided by OpenAI Gym[6], in this work, we choose a simple environment FrozenLake to implement the proof-of-concept experiments. The optimization is chosen to be RMSprop with learning rate=0.01,alpha=0.99 and eps=10-8, which is used widely in deep reinforcement learning. The batch-size for the experience replay is 5. The ϵ-greedy strategy used in this work is the following:

ϵϵepisode100+1

with initial ϵ=1.0 for encouraging more exploration in early episodes and shifting to more exploitation in later episodes.

Location Reward
HOLE -0.2
GOAL +1.0
OTHER -0.01
Table 1: The list of reward in our testing environment FrozenLake. Also note that the environment is non-slippery. Setting of this not only encourage the agent to achieve the goal but also encourage it to select the shortest path.
\Qcircuit @C=1em @R=1em \lstick | 0 ⟩ & \gateR_x(θ_1) & \gateR_z(ϕ_1) & \ctrl1 & \qw& \qw& \gateR(α_1, β_1, γ_1) & \meter\qw\lstick | 0 ⟩ & \gateR_x(θ_2) & \gateR_z(ϕ_2) & \targ& \ctrl1 & \qw& \gateR(α_2, β_2, γ_2) & \meter\qw\lstick | 0 ⟩ & \gateR_x(θ_3) & \gateR_z(ϕ_3) & \qw& \targ& \ctrl1 & \gateR(α_3, β_3, γ_3) & \meter\qw\lstick | 0 ⟩ & \gateR_x(θ_4) & \gateR_z(ϕ_4) & \qw& \qw& \targ& \gateR(α_4, β_4, γ_4) & \meter\gategroup1447.7em–\qw
Figure 1: Circuit architecture for the deep reinforcement learning. This is the circuit architecture for deep Q learning. The parameters labeled θ and ϕ are for state preparation and are not for iterative optimization. The CNOT gates are used to entangle states. Parameters labeled α, β and γ are the ones for optimization. Note that the grouped box may repeat several times thus increase the number of parameters. In this work, the grouped circuit repeat two times and therefore the total number of parameters subject to optimization is 4×3×2=24

3 Results

3.1 Numerical Simulations

In the numerical simulation, we ran 500 episodes and the agent converge to reward 0.9 after about 200 episodes. It is noted that, however, several sub-optimal results occurred. This phenomenon is probably due to the ϵ-greedy policy selection.
For the numerical simulation code and the parameters, please refer to the GitHub repository:
https://github.com/ycchen1989/VariationalQuantumCircuitsAndDeepReinforcementLearning

Figure 2: Variational quantum circuit for Deep Q Learning In 500 iterations, we can see that the agent reach the optimal policy with total reward about 0.9 after about 200 iterations.

4 Discussions

4.1 Relevant Works

For a general review in the field quantum machine learning, please refer to [10] and [11]. It is noted that some relevant works concerning quantum reinforcement learning is around. For example, [5] [17], in these works, a framework called projective simulation is proposed. The key concept in this setting is that the agent has a memory to keep the transition history and before executing each action, the agent will simulate several possible outcomes according to historical data in the memory. It is conceptually related to the well-known Monte-Carlo Tree Search in [30] [31] and it is interesting to investigate the quantum counterparts and possible quantum advantages.

4.2 Other RL Environments

In this work, we use a simple testing environment for the proof-of-concept experiment. The reasons are the following. First, for more complex reinforcement learning benchmark environments, for example, Atari games, the inputs are raw pixels. In these situations, we need much more qubits to conduct the simulations. Although it can be encoded with computational basis encoding or amplitude encoding, the number of qubits needed is intractable for numerical simulations on classical computers and also exceed commercially available quantum devices. It is possible to investigate these complex RL environments with the same setting proposed in this work when large quantum machines are released.

4.3 Other Encoding Scheme

Computational basis encoding in this work still not harvest the full quantum advantages compared to amplitude encoding. Although currently limited by the simulator, we still demonstrate the feasibility of applying quantum circuits to deep reinforcement learning problems. To gain the promising logarithmic less parameters with quantum circuits, amplitude encoding should be used. In that case, the complexity of parameters is poly(logN) while at the same input dimension N, conventional neural network is poly(N), therefore, it is promising to investigate the possibility of applying these circuits to represent the high-dimensional data or the more complex environments.

5 Conclusions

This is the first demonstration of variational quantum circuits to approximate the deep Q-value function with experience replay and target network. Under appropriate encoding method, the possible quantum advantage is poly(logN) compared to conventional neural network with complexity poly(N).

6 Acknowledgements

Thanks to Prof. Yun-Cheng Tsai for constructive discussion and great support. Thanks to Cheng-Lin Hong for helping on setting computational clusters.

References