A regression algorithm for accelerated lattice QCD that exploits sparse inference on the D-Wave quantum annealer

  • 2019-11-14 17:47:19
  • Nga T. T. Nguyen, Garrett T. Kenyon, Boram Yoon
  • 2

Abstract

We propose a regression algorithm that utilizes a learned dictionaryoptimized for sparse inference on D-Wave quantum annealer. In this regressionalgorithm, we concatenate the independent and dependent variables as ancombined vector, and encode the high-order correlations between them into adictionary optimized for sparse reconstruction. On a test dataset, thedependent variable is initialized to its average value and then a sparsereconstruction of the combined vector is obtained in which the dependentvariable is typically shifted closer to its true value, as in a standardinpainting or denoising task. Here, a quantum annealer, which can presumablyexploit a fully entangled initial state to better explore the complex energylandscape, is used to solve the highly non-convex sparse coding optimizationproblem. The regression algorithm is demonstrated for a lattice quantumchromodynamics simulation data using a D-Wave 2000Q quantum annealer and goodprediction performance is achieved. The regression test is performed using sixdifferent values for the number of fully connected logical qubits, between 20and 64, the latter being the maximum that can be embedded on the D-Wave 2000Q.The scaling results indicate that a larger number of qubits gives betterprediction accuracy, the best performance being comparable to the bestclassical regression algorithms reported so far.

 

Quick Read (beta)

A regression algorithm for accelerated lattice QCD that exploits sparse inference on the D-Wave quantum annealer

Nga T.T. Nguyen [email protected] Computer, Computational and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA    Garrett T. Kenyon [email protected] Computer, Computational and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA    Boram Yoon [email protected] Computer, Computational and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
Abstract

We propose a regression algorithm that utilizes a learned dictionary optimized for sparse inference on D-Wave quantum annealer. In this regression algorithm, we concatenate the independent and dependent variables as an combined vector, and encode the high-order correlations between them into a dictionary optimized for sparse reconstruction. On a test dataset, the dependent variable is initialized to its average value and then a sparse reconstruction of the combined vector is obtained in which the dependent variable is typically shifted closer to its true value, as in a standard inpainting or denoising task. Here, a quantum annealer, which can presumably exploit a fully entangled initial state to better explore the complex energy landscape, is used to solve the highly non-convex sparse coding optimization problem. The regression algorithm is demonstrated for a lattice quantum chromodynamics simulation data using a D-Wave 2000Q quantum annealer and good prediction performance is achieved. The regression test is performed using six different values for the number of fully connected logical qubits, between 20 and 64, the latter being the maximum that can be embedded on the D-Wave 2000Q. The scaling results indicate that a larger number of qubits gives better prediction accuracy, the best performance being comparable to the best classical regression algorithms reported so far.

I Introduction

Sparse coding refers to a class of unsupervised learning algorithms for finding an optimized set of bases vectors, or dictionary, for accurately reconstructing inputs drawn from any given dataset using the fewest number of non-zero coefficients. Sparse coding explains the self-organizing response properties of simple cells in the mammalian primary visual cortex olshausen96; olshausen_sparse_1997, and has been successfully applied in various fields including image classification yang2009; Coates:2011:IEV:3104482.3104598, image compression yijing, and compressed sensing Tao_Information; Donoho_Compressed. Optimizing a dictionary ϕM×Nq for a given dataset and inferring optimal sparse representations 𝒂(k)Nq of input data 𝐗(k)M involves finding solutions of the following minimization problem:

minϕkmin𝒂(k)[12||𝐗(k)-ϕ𝒂(k)||2+λ||𝒂||0], (1)

where k is the index of the input data, and λ is the sparsity penalty parameter. Because of the L0-norm, the minimization problem falls into a NP-hard complexity class with multiple local minima Natarajan:1995:SAS:207985.207987.

Recently, we developed a mapping of the 𝒂(k)-optimization in Eq. (1) to the quadratic unconstrained binary optimization (QUBO) problem that can be solved on an quantum annealer and demonstrated its feasibility on the D-Wave systems Nguyen:2016; 8123653; Nguyen2018ImageCU. The quantum processing unit of the D-Wave systems realizes the quantum Ising spin system in a transverse field and finds the lowest or the near-lowest energy states of the classical Ising model,

H(𝒉,𝑱,𝒔)=iNqhisi+i<jNqJijsisj, (2)

using quantum annealing PhysRevE.58.5355; Finnila_1994; DWAVE. Here si=±1 is the binary spin variable, hi and Jij are the qubit biases and coupling strengths that can be controlled by a user, and optimization for the Ising model is isomorphic to a QUBO problem with ai=(si+1)/2. By mapping the sparse coding to a QUBO structure, the sparse coefficients are restricted to binary variables ai{0,1}. Despite the restriction, it was able to provide good sparse representation for the the MNIST lecun-mnisthandwrittendigit-2010; Nguyen:2016; Nguyen2018ImageCU and CIFAR-10 cifar10; 8123653 images.

In this paper, we propose a regression algorithm using the sparse coding on D-Wave 2000Q in Section II and apply the algorithm to a prediction of quantum chromodynamics (QCD) simulation observable in Section III.

II Regression algorithm using sparse coding on D-Wave 2000Q

II.1 Regression model

Consider N sets of training data {𝐗(i),y(i)}i=1N, and M sets of the test data {𝐗(j)}j=1M, where 𝐗(i){x1(i),x2(i),,xD(i)} is an input vector known as the independent variable, and y(i) is an output variable known as the dependent variable. A regression model F can be built by learning correlations between the input and output variables on the training dataset, so that it can make predictions y^ of y for an unseen input data 𝐗 as

F(𝐗)=y^y. (3)

Such regression model can be built using the sparse coding learning implemented on a quantum annealer described below.

  • Pre-training

    • (1)

      Normalize xd(i) and y(i) so that their standard deviations becomes comparable. One possible choice is rescaling the data to have a zero mean and a unit variance using the sample mean and sample variance of the training dataset. This procedure is an essential step for the regression algorithm as it makes the reconstruction error for each component comparable.

    • (2)

      Using 𝐗 in the test dataset (M) or those in the combined training and test datasets (N+M), perform sparse coding training and obtain the dictionary ϕ for 𝐗.

  • Training

    • (3)

      Concatenate the input and output variables of the training dataset and build the concatenated vectors 𝐗~(i){x1(i),x2(i),,xD(i),y(i)}. Extend the dictionary matrix ϕD×Nq obtained in the pre-training to ϕ~o(D+1)×Nq, filling up the new elements by zeros.

    • (4)

      Taking 𝐗~(i) as the input signal and ϕ~o as an initial guess of the dictionary, perform sparse coding training on the training dataset and obtain the dictionary ϕ~. Through this procedure, ϕ~ will encode the correlation between xd(i) and y(i).

  • Prediction

    • (5)

      For the test dataset, for which only 𝐗(j) is given, build a vector 𝐗~o(j){x1(j),x2(j),,xD(j),y¯(j)}, where y¯(j) is an initial guess of y(j). One possible choice of y¯(j) is the average value of y(i) in the training dataset.

    • (6)

      Using the dictionary ϕ~ obtained in (4), find sparse representation 𝒂(j) for 𝐗~o(j) and calculate reconstruction as 𝐗~(j)=ϕ~𝒂(j). This replaces the outlier components, including y¯(j), in 𝐗~o(j) by the values that can be described by ϕ~.

    • (7)

      After inverse-normalization, the (D+1)’th component of 𝐗~(j) is the prediction of y(j): (𝐗~(j))D+1=y^(j)y(j).

In this regression model, D should be sufficiently large so that initial guess of the dependent variable y¯j does not bias the reconstruction. This procedure can be extended to predict multiple variables by increasing the dimension of y, in exchange for prediction accuracy.

II.2 Sparse coding on D-Wave quantum annealer

The 𝒂(k)-optimization of the sparse coding problem in Eq. (1), can be mapped onto the D-Wave problem in Eq. (2), by the following transformations Nguyen:2016; 8123653; Nguyen2018ImageCU:

hi =(-ϕT𝐗+(λ+12))i, (4)
Jij =(ϕTϕ)ij,
si =2ai-1.

Here the qubit-qubit coupling 𝑱 shares similarity with the lateral neuron-neuron inhibition in the locally competitive algorithm rozell.06c, and the constant λ makes the solution sparse by acting a constant field forcing the qubits to stay in ai=0 (si=-1) state. By performing the quantum annealing for a given dictionary ϕ and input data vector 𝐗 with the transformations given in Eq. (4), one can obtain the optimal sparse representation 𝒂. On the D-Wave systems, however, computing graph, known as the Chimera architecture, has limited qubit connectivity by design, while the sparse coding problem mapped to the QUBO form requires a fully connected qubit graph. Therefore, it requires an additional step called embedding, which translates the fully-connected logical qubits to the partially-connected physical qubits on the Chimera structure, as described in Fig. 1.

Figure 1: A subset (1/64) of the Chimera structure of the D-Wave 2000Q consisting of 32 qubits (circles) arranged in 2×2 matrix of unit cells of 8 qubits. The qubits within a unit cell have relatively dense connections, while the interactions between the unit cells can be made through the sparse connections in their edges. This figure also shows an example of embedding 6 fully-connected logical qubits (numbers from 1 to 6 inside 14 circles) onto the D-Wave chimera using 14 physical qubits, in which red edges indicate bipartite couplings between qubits while blue edges indicate chained qubits. After such embedding, for example, the logical qubit 1 is mapped to two physical qubits tiled from one qubit in the top right and one in the bottom right unit cell, while the logical qubit 2 mapped to three physical qubits tiled from two qubits in the top left and one qubit in the top right unit cell, and so forth.

The D-Wave 2000Q hardware consists of 2048 quantum bits (qubits) arranged into 16×16 unit cells of 8 physical qubits forming the Chimera structure. After accounting the less than 3% of defects that occurs during the calibration process, as temperature is cooled down to a superconducting threshold, the machine have approximately 2000 qubits with larger than 6000 local spin-spin interactions. Using these qubits and connections, embedding an arbitrary QUBO problem onto the D-Wave 2000Q typically allows no more than Nq=64 fully connected logical qubits.

III Application to lattice QCD

QCD is a theory of quarks and gluons, which are the fundamental particles composing hadrons such as pions and protons, and their interactions. It is a part of the Standard Model of particle physics, and the theory has been demonstrated by large class of experiments over the decades Patrignani:2016xqp; Greensite:2011zz. Lattice QCD is an discrete formulation of QCD on an Euclidean space time lattice, which allows us to solve low-energy QCD problems using computer simulations by carrying out the Feynman path integration using Monte Carlo methods Wilson:1974sk; Creutz:1980zw.

In lattice QCD simulations, large number of observables are calculated over an ensemble of the Gibbs samples of gluon fields, called the lattices, and computational cost for calculating those observables are expensive in modern simulations. However, the observables’ fluctuations over the statistical samples of the lattices are correlated as they share the same background lattice. By exploiting the correlation between them, in Ref. Yoon:2018krb, boosted decision tree (BDT) regression algorithm was able to replace the computationally expensive direct calculation of some observables by the computationally cheap machine learning predictions of them from other observables.

In this section, we apply the regression algorithm proposed in Section II to the lattice QCD simulation data used for the calculation of the charge-parity (CP) symmetry violating phase αCPV of the neutron Yoon:2017tag; Pospelov:2005pr. Here we consider three types of observables: (1) two-point correlation functions of neutrons calculated without CP violating (CPV) interactions C2pt, (2) γ5-projected two-point correlation functions of neutrons calculated without CPV interactions C2ptP, and (3) γ5-projected two-point correlation functions of neutrons calculated with CPV interactions C2ptP,CPV, and the phase αCPV is extracted from the imaginary part of C2ptP,CPV. Those observables are calculated at multiple values of the nucleon source and sink separations in Euclidean time direction t.

III.1 Method

Our goal of the regression problem is to predict the imaginary part of C2ptP,CPV at t=10a from the real and imaginary parts of the two-point correlation functions calculated without CPV interactions, C2pt and C2ptP, at t=8a,9a,10a,11a, and 12a, where a is the lattice spacing. It forms a problem with single value of output variable (y) and 20 values (two observables, real/imag, 5 timeslices) of the input variables (𝐗). In this application, we use 15616 data points of of these observables measured in Refs. Bhattacharya:2016oqm; Bhattacharya:2016rrc divided into 6976 training data and 8640 test data. Using this datasets, we follow the regression procedure proposed in Section II to predict y of the test dataset that contains around 9K data points.

The procedure can be summarized as following. First, we standardize the total data using the mean and variance of the training dataset for normalization. Then, we perform the pre-training and obtained ϕ for the 20 elements of 𝐗 only using the test dataset. After appending the y to 𝐗 as the 21st element in the training dataset, we perform the sparse coding dictionary learning and update ϕ to encode correlation between 𝐗 and y. For prediction, input vectors 𝐗 in the test dataset are augmented to dimension of 21 vectors by appending the average value of y, which is 0 after standardization. Finally, sparse coefficients 𝒂 for the augmented input vectors are calculated with the fixed dictionary ϕ obtained above, and predictions of y are estimated by taking the 21st element of the reconstructed vectors on the test dataset.

Figure 2: Original, or ground truth, data (blue circles) and the reconstruction from the missing-21-element data using D-Wave 2000Q with Nq=64 (red squares) for two randomly chosen data points. Here the 21-element is the dependent variable of the prediction, whose initial value before the reconstruction is given by 0.

Note that a sparse coding problem solves for the sparsest representation 𝒂 and the dictionary ϕ, simultaneously, by minimizing Eq. (1). First, our optimization for 𝒂 is performed using the D-Wave 2000Q at a given ϕ, whose initial guess is given, in general, by random numbers or via imprinting technique. Then, the optimization for ϕ is performed on classical CPUs. The latter step is an offline learning for the fixed values of 𝒂 obtained using D-Wave 2000Q. In the offline learning procedure, ϕ is learned using the a batch stochastic gradient descent (SGD) algorithm:

ϕ:=ϕ-ηEbϕ, (5)

where Eb=1nbi=1nbEi with Ei is the sparse coding energy function for a given input data given in Eq. (1), and η is the learning rate. Batch-learning is used with the batch size of nb=50. We repeat the iterative update of the quantum D-Wave inference for 𝒂 and SGD learning for ϕ a convergence is attained. On average, we find the convergence after 4 or 5 iterations. In this study, we use the SAPI2 python client libraries sapi for implementing D-Wave operations.

The sparsity of the sparse representation 𝒂 associated with the sparsity penalty parameter λ is calculated by the ratio of nonzero elements in 𝒂. In this study, λ is tuned to the values that make the average sparsity about 20%, because we find that the 20% of sparsity provides an optimal prediction performance, after examining a few different values of λ. Although the prediction performance could be further optimized by an extensive parameter search, such as that performed in Ref. Kenyon_phasetransition, the procedure is computationally expensive so ignored in this proof-of-principle study.

Note that the definition of the overcompleteness is not straightforward for the D-Wave inferred sparse coding because the input signal 𝐗 may have arbitrary real numbers, while the sparse coefficients 𝒂 could have only binary numbers of 0 or 1. Ignoring the subtlety, for simplicity, the overcompleteness γ for the input signal of dimension 20 (or 21 for extended vectors) can be calculated by γ=Nq/20.

III.2 Results

Examples of the reconstruction and prediction from the randomly chosen test data points are visualized in Fig. 2. In the plot, first 20 elements are the input variables, and the element 21 is the output of the prediction algorithm. As one can see, the reconstruction of the 21st element, which was 0 on their in initial guess, are successfully shifted close to their ground truth, as expected.

Figure 3: Distribution of the prediction error Δ(i) of the 21st element plotted against the distribution of the ground truth for different numbers of qubits Nq=20,29,38,47,55, and 64. The narrower width of the prediction error indicates the better prediction. Standard deviations of the prediction errors for Nq=20,29,38,47,55, and 64 are 0.41, 0.375, 0.319, 0.29, 0.273 and 0.254, respectively. Scaling of the prediction error is summarized in Fig. 4.

In order to investigate the prediction accuracy for different Nq, we explore the prediction algorithm with six different numbers of qubits Nq=20,29,38,47,55 and 64, which corresponds to γ13. Note that the larger Nq implies the more difficult optimization problem, and Nq=64 is the maximum number of logical qubits that can be embedded onto the D-Wave 2000Q. In Fig. 3, we show the distribution of the normalized original data of the dependent variable y(i) and its prediction error Δ(i) defined by the difference between the ground truth y(i) and its prediction y^(i): Δ(i)=y(i)-y^(i). It is clearly demonstrated that (1) the prediction error is much smaller than the fluctuation of the original data, (2) the prediction error is sharply distributed near 0, which indicates no obvious bias in the prediction, and (3) the prediction error tends to be smaller when Nq becomes larger.

To evaluate the prediction quality, the recovery of the 21st element in the extended input vector, quantitatively, we calculate the ratio of the standard deviations of the prediction error and that of the original data: Qσ(Δ)/σ(y). Q converges to 0 when the prediction is precise, and Q1 indicates no prediction for a statistical data. Note that this definition of the prediction quality does not account for the bias of the prediction because the bias for the prediction of a statistical data can be removed by following the procedure introduced in Ref. Yoon:2018krb based on the variance reduction technique for lattice QCD calculations Bali:2009hu; Blum:2012uh.

Fig. 4 shows the prediction error Q as a function of the number of qubits. It is clearly demonstrated that the systematic decrease of the prediction error as Nq is increased. Although no theory explaining the scaling is known, we find that the scaling roughly follows the exponential decay ansatz Q+Bexp[-CNq]. By fitting the ansatz to the data points, an asymptotic value of the prediction quality is obtained as Q0.18 or 0.23 for Nq, depending on whether we include Nq=20 data point or not in the fit. For a comparison, regression algorithms provided by the scikit-learn library scikit-learn on a classical computer are investigated for the same dataset, and BDT regression algorithm showed the best prediction performance with Q=0.15.

Figure 4: Prediction error Q with the sparse coding regression algorithm implemented on D-Wave 2000Q applied to prediction of the lattice QCD simulation data as a function of Nq (red squares). An exponential ansatz is fitted to the data points for all Nq (blue solid line) and those excluding Nq=20 (green dashed line).

Pre-training is demonstrated to lower the prediction error of this regression algorithm, significantly. When performed the prediction with Nq=64 qubits without the pre-training procedure, we find that Q=0.34, while it becomes Q=0.254 with the pre-training. Without the pre-training, furthermore, we find that the required number of iterative updates of the D-Wave inference for 𝒂 and SGD learning for ϕ is increased to about 10 iterations.

IV Conclusion

In this paper, we proposed a regression algorithm using sparse coding dictionary learning that can be implemented on a quantum annealer, based on the formulation of a regression as an inpainting problem. A pre-training technique is introduced to improved the prediction quality. The procedure is described in Section II.1. The regression algorithm was numerically demonstrated using a set of lattice QCD simulation observables and was able to predict the correlation function calculated in presence of the CPV interactions from those calculated without the CPV interaction. The regression experiment is carried out using the D-Wave 2000Q quantum annealer with minor embedding technique in order to obtain fully-connected logical qubits. The study is performed for six different values of the number of qubits between 20 and 64, and it showed systematic decrease of the prediction error as the number of qubits is increased (see Fig. 4). With larger number of qubits and elaborately tuned sparsity parameter, we expect further improved performance in future.

Acknowledgements.
The sparse coding optimizations were carried out using the D-Wave 2000Q at Los Alamos National Laboratory (LANL). Simulation data used for numerical experiment were generated using the computer facilities at (i)the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; and, (ii) the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725; (iii) the USQCD Collaboration, which are funded by the Office of Science of the U.S. Department of Energy, (iv) Institutional Computing at Los Alamos National Laboratory. This work was supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics under Contract No. 89233218CNA000001. BY also acknowledges support from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Nuclear Physics, Scientific Discovery through Advanced Computing (SciDAC) program, and the LANL LDRD program.

References