Abstract
We propose a regression algorithm that utilizes a learned dictionaryoptimized for sparse inference on DWave quantum annealer. In this regressionalgorithm, we concatenate the independent and dependent variables as ancombined vector, and encode the highorder correlations between them into adictionary optimized for sparse reconstruction. On a test dataset, thedependent variable is initialized to its average value and then a sparsereconstruction of the combined vector is obtained in which the dependentvariable is typically shifted closer to its true value, as in a standardinpainting or denoising task. Here, a quantum annealer, which can presumablyexploit a fully entangled initial state to better explore the complex energylandscape, is used to solve the highly nonconvex sparse coding optimizationproblem. The regression algorithm is demonstrated for a lattice quantumchromodynamics simulation data using a DWave 2000Q quantum annealer and goodprediction performance is achieved. The regression test is performed using sixdifferent values for the number of fully connected logical qubits, between 20and 64, the latter being the maximum that can be embedded on the DWave 2000Q.The scaling results indicate that a larger number of qubits gives betterprediction accuracy, the best performance being comparable to the bestclassical regression algorithms reported so far.
Quick Read (beta)
A regression algorithm for accelerated lattice QCD that exploits sparse inference on the DWave quantum annealer
Abstract
We propose a regression algorithm that utilizes a learned dictionary optimized for sparse inference on DWave quantum annealer. In this regression algorithm, we concatenate the independent and dependent variables as an combined vector, and encode the highorder correlations between them into a dictionary optimized for sparse reconstruction. On a test dataset, the dependent variable is initialized to its average value and then a sparse reconstruction of the combined vector is obtained in which the dependent variable is typically shifted closer to its true value, as in a standard inpainting or denoising task. Here, a quantum annealer, which can presumably exploit a fully entangled initial state to better explore the complex energy landscape, is used to solve the highly nonconvex sparse coding optimization problem. The regression algorithm is demonstrated for a lattice quantum chromodynamics simulation data using a DWave 2000Q quantum annealer and good prediction performance is achieved. The regression test is performed using six different values for the number of fully connected logical qubits, between 20 and 64, the latter being the maximum that can be embedded on the DWave 2000Q. The scaling results indicate that a larger number of qubits gives better prediction accuracy, the best performance being comparable to the best classical regression algorithms reported so far.
I Introduction
Sparse coding refers to a class of unsupervised learning algorithms for finding an optimized set of bases vectors, or dictionary, for accurately reconstructing inputs drawn from any given dataset using the fewest number of nonzero coefficients. Sparse coding explains the selforganizing response properties of simple cells in the mammalian primary visual cortex olshausen96; olshausen_sparse_1997, and has been successfully applied in various fields including image classification yang2009; Coates:2011:IEV:3104482.3104598, image compression yijing, and compressed sensing Tao_Information; Donoho_Compressed. Optimizing a dictionary $\mathit{\varphi}\in {\mathbb{R}}^{M\times {N}_{q}}$ for a given dataset and inferring optimal sparse representations ${\bm{a}}^{(k)}\in {\mathbb{R}}^{{N}_{q}}$ of input data ${\mathbf{X}}^{(k)}\in {\mathbb{R}}^{M}$ involves finding solutions of the following minimization problem:
$\underset{\mathit{\varphi}}{\mathrm{min}}{\displaystyle \sum _{k}}\underset{{\bm{a}}^{(k)}}{\mathrm{min}}\left[{\displaystyle \frac{1}{2}}{{\mathbf{X}}^{(k)}\mathit{\varphi}{\bm{a}}^{(k)}}^{2}+\lambda {\bm{a}}_{0}\right],$  (1) 
where $k$ is the index of the input data, and $\lambda $ is the sparsity penalty parameter. Because of the ${L}_{0}$norm, the minimization problem falls into a NPhard complexity class with multiple local minima Natarajan:1995:SAS:207985.207987.
Recently, we developed a mapping of the ${\bm{a}}^{(k)}$optimization in Eq. (1) to the quadratic unconstrained binary optimization (QUBO) problem that can be solved on an quantum annealer and demonstrated its feasibility on the DWave systems Nguyen:2016; 8123653; Nguyen2018ImageCU. The quantum processing unit of the DWave systems realizes the quantum Ising spin system in a transverse field and finds the lowest or the nearlowest energy states of the classical Ising model,
$$  (2) 
using quantum annealing PhysRevE.58.5355; Finnila_1994; DWAVE. Here ${s}_{i}=\pm 1$ is the binary spin variable, ${h}_{i}$ and ${J}_{ij}$ are the qubit biases and coupling strengths that can be controlled by a user, and optimization for the Ising model is isomorphic to a QUBO problem with ${a}_{i}=({s}_{i}+1)/2$. By mapping the sparse coding to a QUBO structure, the sparse coefficients are restricted to binary variables ${a}_{i}\in \{0,1\}$. Despite the restriction, it was able to provide good sparse representation for the the MNIST lecunmnisthandwrittendigit2010; Nguyen:2016; Nguyen2018ImageCU and CIFAR10 cifar10; 8123653 images.
II Regression algorithm using sparse coding on DWave 2000Q
II.1 Regression model
Consider $N$ sets of training data ${\{{\mathbf{X}}^{(i)},{y}^{(i)}\}}_{i=1}^{N}$, and $M$ sets of the test data ${\{{\mathbf{X}}^{(j)}\}}_{j=1}^{M}$, where ${\mathbf{X}}^{(i)}\equiv \{{x}_{1}^{(i)},{x}_{2}^{(i)},\mathrm{\dots},{x}_{D}^{(i)}\}$ is an input vector known as the independent variable, and ${y}^{(i)}$ is an output variable known as the dependent variable. A regression model $F$ can be built by learning correlations between the input and output variables on the training dataset, so that it can make predictions $\widehat{y}$ of $y$ for an unseen input data $\mathbf{X}$ as
$F(\mathbf{X})=\widehat{y}\approx y.$  (3) 
Such regression model can be built using the sparse coding learning implemented on a quantum annealer described below.

•
Pretraining

(1)
Normalize ${x}_{d}^{(i)}$ and ${y}^{(i)}$ so that their standard deviations becomes comparable. One possible choice is rescaling the data to have a zero mean and a unit variance using the sample mean and sample variance of the training dataset. This procedure is an essential step for the regression algorithm as it makes the reconstruction error for each component comparable.

(2)
Using $\mathbf{X}$ in the test dataset ($M$) or those in the combined training and test datasets ($N+M$), perform sparse coding training and obtain the dictionary $\mathit{\varphi}$ for $\mathbf{X}$.

(1)

•
Training

(3)
Concatenate the input and output variables of the training dataset and build the concatenated vectors ${\stackrel{~}{\mathbf{X}}}^{(i)}\equiv \{{x}_{1}^{(i)},{x}_{2}^{(i)},\mathrm{\dots},{x}_{D}^{(i)},{y}^{(i)}\}$. Extend the dictionary matrix $\mathit{\varphi}\in {\mathbb{R}}^{D\times {N}_{q}}$ obtained in the pretraining to ${\stackrel{~}{\mathit{\varphi}}}_{o}\in {\mathbb{R}}^{(D+1)\times {N}_{q}}$, filling up the new elements by zeros.

(4)
Taking ${\stackrel{~}{\mathbf{X}}}^{(i)}$ as the input signal and ${\stackrel{~}{\mathit{\varphi}}}_{o}$ as an initial guess of the dictionary, perform sparse coding training on the training dataset and obtain the dictionary $\stackrel{~}{\mathit{\varphi}}$. Through this procedure, $\stackrel{~}{\mathit{\varphi}}$ will encode the correlation between ${x}_{d}^{(i)}$ and ${y}^{(i)}$.

(3)

•
Prediction

(5)
For the test dataset, for which only ${\mathbf{X}}^{(j)}$ is given, build a vector ${\stackrel{~}{\mathbf{X}}}_{o}^{(j)}\equiv \{{x}_{1}^{(j)},{x}_{2}^{(j)},\mathrm{\dots},{x}_{D}^{(j)},{\overline{y}}^{(j)}\}$, where ${\overline{y}}^{(j)}$ is an initial guess of ${y}^{(j)}$. One possible choice of ${\overline{y}}^{(j)}$ is the average value of ${y}^{(i)}$ in the training dataset.

(6)
Using the dictionary $\stackrel{~}{\mathit{\varphi}}$ obtained in (4), find sparse representation ${\bm{a}}^{(j)}$ for ${\stackrel{~}{\mathbf{X}}}_{o}^{(j)}$ and calculate reconstruction as ${\stackrel{~}{\mathbf{X}}}^{\prime (j)}=\stackrel{~}{\mathit{\varphi}}{\bm{a}}^{(j)}$. This replaces the outlier components, including ${\overline{y}}^{(j)}$, in ${\stackrel{~}{\mathbf{X}}}_{o}^{(j)}$ by the values that can be described by $\stackrel{~}{\mathit{\varphi}}$.

(7)
After inversenormalization, the $(D+1)$’th component of ${\stackrel{~}{\mathbf{X}}}^{\prime (j)}$ is the prediction of ${y}^{(j)}$: ${({\stackrel{~}{\mathbf{X}}}^{\prime (j)})}_{D+1}={\widehat{y}}^{(j)}\approx {y}^{(j)}$.

(5)
In this regression model, $D$ should be sufficiently large so that initial guess of the dependent variable ${\overline{y}}_{j}$ does not bias the reconstruction. This procedure can be extended to predict multiple variables by increasing the dimension of $y$, in exchange for prediction accuracy.
II.2 Sparse coding on DWave quantum annealer
The ${\bm{a}}^{(k)}$optimization of the sparse coding problem in Eq. (1), can be mapped onto the DWave problem in Eq. (2), by the following transformations Nguyen:2016; 8123653; Nguyen2018ImageCU:
${h}_{i}$  $={({\mathit{\varphi}}^{T}\mathbf{X}+(\lambda +{\displaystyle \frac{1}{2}}))}_{i},$  (4)  
${J}_{ij}$  $={({\mathit{\varphi}}^{T}\mathit{\varphi})}_{ij},$  
${s}_{i}$  $=2{a}_{i}1.$ 
Here the qubitqubit coupling $\bm{J}$ shares similarity with the lateral neuronneuron inhibition in the locally competitive algorithm rozell.06c, and the constant $\lambda $ makes the solution sparse by acting a constant field forcing the qubits to stay in ${a}_{i}=0$ (${s}_{i}=1$) state. By performing the quantum annealing for a given dictionary $\mathit{\varphi}$ and input data vector $\mathbf{X}$ with the transformations given in Eq. (4), one can obtain the optimal sparse representation $\bm{a}$. On the DWave systems, however, computing graph, known as the Chimera architecture, has limited qubit connectivity by design, while the sparse coding problem mapped to the QUBO form requires a fully connected qubit graph. Therefore, it requires an additional step called embedding, which translates the fullyconnected logical qubits to the partiallyconnected physical qubits on the Chimera structure, as described in Fig. 1.
The DWave 2000Q hardware consists of $2048$ quantum bits (qubits) arranged into $16\times 16$ unit cells of $8$ physical qubits forming the Chimera structure. After accounting the less than $3\%$ of defects that occurs during the calibration process, as temperature is cooled down to a superconducting threshold, the machine have approximately $2000$ qubits with larger than $6000$ local spinspin interactions. Using these qubits and connections, embedding an arbitrary QUBO problem onto the DWave 2000Q typically allows no more than ${N}_{q}=64$ fully connected logical qubits.
III Application to lattice QCD
QCD is a theory of quarks and gluons, which are the fundamental particles composing hadrons such as pions and protons, and their interactions. It is a part of the Standard Model of particle physics, and the theory has been demonstrated by large class of experiments over the decades Patrignani:2016xqp; Greensite:2011zz. Lattice QCD is an discrete formulation of QCD on an Euclidean space time lattice, which allows us to solve lowenergy QCD problems using computer simulations by carrying out the Feynman path integration using Monte Carlo methods Wilson:1974sk; Creutz:1980zw.
In lattice QCD simulations, large number of observables are calculated over an ensemble of the Gibbs samples of gluon fields, called the lattices, and computational cost for calculating those observables are expensive in modern simulations. However, the observables’ fluctuations over the statistical samples of the lattices are correlated as they share the same background lattice. By exploiting the correlation between them, in Ref. Yoon:2018krb, boosted decision tree (BDT) regression algorithm was able to replace the computationally expensive direct calculation of some observables by the computationally cheap machine learning predictions of them from other observables.
In this section, we apply the regression algorithm proposed in Section II to the lattice QCD simulation data used for the calculation of the chargeparity (CP) symmetry violating phase ${\alpha}_{\text{CPV}}$ of the neutron Yoon:2017tag; Pospelov:2005pr. Here we consider three types of observables: (1) twopoint correlation functions of neutrons calculated without CP violating (CPV) interactions ${C}_{\text{2pt}}$, (2) ${\gamma}_{5}$projected twopoint correlation functions of neutrons calculated without CPV interactions ${C}_{\text{2pt}}^{P}$, and (3) ${\gamma}_{5}$projected twopoint correlation functions of neutrons calculated with CPV interactions ${C}_{\text{2pt}}^{P,\text{CPV}}$, and the phase ${\alpha}_{\text{CPV}}$ is extracted from the imaginary part of ${C}_{\text{2pt}}^{P,\text{CPV}}$. Those observables are calculated at multiple values of the nucleon source and sink separations in Euclidean time direction $t$.
III.1 Method
Our goal of the regression problem is to predict the imaginary part of ${C}_{\text{2pt}}^{P,\text{CPV}}$ at $t=10a$ from the real and imaginary parts of the twopoint correlation functions calculated without CPV interactions, ${C}_{\text{2pt}}$ and ${C}_{\text{2pt}}^{P}$, at $t=8a,9a,10a,11a,$ and $12a$, where $a$ is the lattice spacing. It forms a problem with single value of output variable ($y$) and 20 values (two observables, real/imag, 5 timeslices) of the input variables ($\mathbf{X}$). In this application, we use 15616 data points of of these observables measured in Refs. Bhattacharya:2016oqm; Bhattacharya:2016rrc divided into 6976 training data and 8640 test data. Using this datasets, we follow the regression procedure proposed in Section II to predict $y$ of the test dataset that contains around 9K data points.
The procedure can be summarized as following. First, we standardize the total data using the mean and variance of the training dataset for normalization. Then, we perform the pretraining and obtained $\mathit{\varphi}$ for the 20 elements of $\mathbf{X}$ only using the test dataset. After appending the $y$ to $\mathbf{X}$ as the 21st element in the training dataset, we perform the sparse coding dictionary learning and update $\mathit{\varphi}$ to encode correlation between $\mathbf{X}$ and $y$. For prediction, input vectors $\mathbf{X}$ in the test dataset are augmented to dimension of 21 vectors by appending the average value of $y$, which is 0 after standardization. Finally, sparse coefficients $\bm{a}$ for the augmented input vectors are calculated with the fixed dictionary $\mathit{\varphi}$ obtained above, and predictions of $y$ are estimated by taking the 21st element of the reconstructed vectors on the test dataset.
Note that a sparse coding problem solves for the sparsest representation $\bm{a}$ and the dictionary $\mathit{\varphi}$, simultaneously, by minimizing Eq. (1). First, our optimization for $\bm{a}$ is performed using the DWave 2000Q at a given $\mathit{\varphi}$, whose initial guess is given, in general, by random numbers or via imprinting technique. Then, the optimization for $\mathit{\varphi}$ is performed on classical CPUs. The latter step is an offline learning for the fixed values of $\bm{a}$ obtained using DWave 2000Q. In the offline learning procedure, $\mathit{\varphi}$ is learned using the a batch stochastic gradient descent (SGD) algorithm:
$$\mathit{\varphi}:=\mathit{\varphi}\eta \frac{\partial {E}_{b}}{\partial \mathit{\varphi}},$$  (5) 
where ${E}_{b}=\frac{1}{{n}_{b}}{\sum}_{i=1}^{{n}_{b}}{E}_{i}$ with ${E}_{i}$ is the sparse coding energy function for a given input data given in Eq. (1), and $\eta $ is the learning rate. Batchlearning is used with the batch size of ${n}_{b}=50$. We repeat the iterative update of the quantum DWave inference for $\bm{a}$ and SGD learning for $\mathit{\varphi}$ a convergence is attained. On average, we find the convergence after $4$ or $5$ iterations. In this study, we use the SAPI2 python client libraries sapi for implementing DWave operations.
The sparsity of the sparse representation $\bm{a}$ associated with the sparsity penalty parameter $\lambda $ is calculated by the ratio of nonzero elements in $\bm{a}$. In this study, $\lambda $ is tuned to the values that make the average sparsity about 20%, because we find that the 20% of sparsity provides an optimal prediction performance, after examining a few different values of $\lambda $. Although the prediction performance could be further optimized by an extensive parameter search, such as that performed in Ref. Kenyon_phasetransition, the procedure is computationally expensive so ignored in this proofofprinciple study.
Note that the definition of the overcompleteness is not straightforward for the DWave inferred sparse coding because the input signal $\mathbf{X}$ may have arbitrary real numbers, while the sparse coefficients $\bm{a}$ could have only binary numbers of 0 or 1. Ignoring the subtlety, for simplicity, the overcompleteness $\gamma $ for the input signal of dimension 20 (or 21 for extended vectors) can be calculated by $\gamma ={N}_{q}/20$.
III.2 Results
Examples of the reconstruction and prediction from the randomly chosen test data points are visualized in Fig. 2. In the plot, first 20 elements are the input variables, and the element 21 is the output of the prediction algorithm. As one can see, the reconstruction of the $21$st element, which was 0 on their in initial guess, are successfully shifted close to their ground truth, as expected.
In order to investigate the prediction accuracy for different ${N}_{q}$, we explore the prediction algorithm with six different numbers of qubits ${N}_{q}=20,29,38,47,55$ and $64$, which corresponds to $\gamma \approx 1\sim 3$. Note that the larger ${N}_{q}$ implies the more difficult optimization problem, and ${N}_{q}=64$ is the maximum number of logical qubits that can be embedded onto the DWave 2000Q. In Fig. 3, we show the distribution of the normalized original data of the dependent variable ${y}^{(i)}$ and its prediction error ${\mathrm{\Delta}}^{(i)}$ defined by the difference between the ground truth ${y}^{(i)}$ and its prediction ${\widehat{y}}^{(i)}$: ${\mathrm{\Delta}}^{(i)}={y}^{(i)}{\widehat{y}}^{(i)}$. It is clearly demonstrated that (1) the prediction error is much smaller than the fluctuation of the original data, (2) the prediction error is sharply distributed near 0, which indicates no obvious bias in the prediction, and (3) the prediction error tends to be smaller when ${N}_{q}$ becomes larger.
To evaluate the prediction quality, the recovery of the $21$st element in the extended input vector, quantitatively, we calculate the ratio of the standard deviations of the prediction error and that of the original data: $Q\equiv \sigma (\mathrm{\Delta})/\sigma (y)$. $Q$ converges to $0$ when the prediction is precise, and $Q\ge 1$ indicates no prediction for a statistical data. Note that this definition of the prediction quality does not account for the bias of the prediction because the bias for the prediction of a statistical data can be removed by following the procedure introduced in Ref. Yoon:2018krb based on the variance reduction technique for lattice QCD calculations Bali:2009hu; Blum:2012uh.
Fig. 4 shows the prediction error $Q$ as a function of the number of qubits. It is clearly demonstrated that the systematic decrease of the prediction error as ${N}_{q}$ is increased. Although no theory explaining the scaling is known, we find that the scaling roughly follows the exponential decay ansatz ${Q}_{\mathrm{\infty}}+B\cdot \mathrm{exp}[C\cdot {N}_{q}]$. By fitting the ansatz to the data points, an asymptotic value of the prediction quality is obtained as ${Q}_{\mathrm{\infty}}\approx 0.18$ or $0.23$ for ${N}_{q}\to \mathrm{\infty}$, depending on whether we include ${N}_{q}=20$ data point or not in the fit. For a comparison, regression algorithms provided by the scikitlearn library scikitlearn on a classical computer are investigated for the same dataset, and BDT regression algorithm showed the best prediction performance with $Q=0.15$.
Pretraining is demonstrated to lower the prediction error of this regression algorithm, significantly. When performed the prediction with ${N}_{q}=64$ qubits without the pretraining procedure, we find that $Q=0.34$, while it becomes $Q=0.254$ with the pretraining. Without the pretraining, furthermore, we find that the required number of iterative updates of the DWave inference for $\bm{a}$ and SGD learning for $\mathit{\varphi}$ is increased to about 10 iterations.
IV Conclusion
In this paper, we proposed a regression algorithm using sparse coding dictionary learning that can be implemented on a quantum annealer, based on the formulation of a regression as an inpainting problem. A pretraining technique is introduced to improved the prediction quality. The procedure is described in Section II.1. The regression algorithm was numerically demonstrated using a set of lattice QCD simulation observables and was able to predict the correlation function calculated in presence of the CPV interactions from those calculated without the CPV interaction. The regression experiment is carried out using the DWave 2000Q quantum annealer with minor embedding technique in order to obtain fullyconnected logical qubits. The study is performed for six different values of the number of qubits between 20 and 64, and it showed systematic decrease of the prediction error as the number of qubits is increased (see Fig. 4). With larger number of qubits and elaborately tuned sparsity parameter, we expect further improved performance in future.