Abstract
We present the application of Restricted Boltzmann Machines (RBMs) to thetask of astronomical image classification using a quantum annealer built byDWave Systems. Morphological analysis of galaxies provides criticalinformation for studying their formation and evolution across cosmic timescales. We compress the images using principal component analysis to fit arepresentation on the quantum hardware. Then, we train RBMs with discriminativeand generative algorithms, including contrastive divergence and hybridgenerativediscriminative approaches. We compare these methods to QuantumAnnealing (QA), Markov Chain Monte Carlo (MCMC) Gibbs Sampling, SimulatedAnnealing (SA) as well as machine learning algorithms like gradient boosteddecision trees. We find that RBMs implemented on Dwave hardware perform well,and that they show some classification performance advantages on smalldatasets, but they don't offer a broadly strategic advantage for this task.During this exploration, we analyzed the steps required for Boltzmann samplingwith the DWave 2000Q, including a study of temperature estimation, andexamined the impact of qubit noise by comparing and contrasting the originalDWave 2000Q to the lowernoise version recently made available. While theseanalyses ultimately had minimal impact on the performance of the RBMs, weinclude them for reference.
Quick Read (beta)
1 Boltzmann Distributions on the DWave Quantum Annealer
In order to train a Boltzmann machine, we need to sample expectation values from a Boltzmann distribution with $\beta $ set to 1. We use the KolmogorovSmirnov (KS) test to check the statistical consistency of our sample distribution with that of a Boltzmann distribution.
1.1 Comparisons between initializing sampling with an annealer vs a random bitstring
The raw distribution of states coming from a DWave 2000Q is often not close to a Boltzmann distribution with $\beta =1$. It is “colder”, with a higher propensity for producing states at the lowest energy levels. This energy shift may be advantageous in optimization problems, but RBM training relies on being able to sample from a Boltzmann distribution, so postprocessing is generally required.
For us, this process will consist of taking a few steps of Gibbs sampling as a postprocessing step. In this section, to check how many steps is enough, we carry out the KS test after each step and keep taking Gibbs steps until the KS pvalue rises above 0.05.
The advantage in starting the postprocessing using samples from a DWave 2000Q is not clear in some of the RBM shown in Fig. 1, for instance in Fig. 1. On the other hand, Fig. 1 and Fig. 1 show some advantages. In all cases, however, there are regions of couplings for which we need quite a few steps, as shown in Fig. 1.
1.2 Temperature estimation
It is possible that the DWave returns a Boltzmann distribution, but at a temperature that needs to be determined. If we know the effective inverse temperature ${\beta}_{\text{text}eff}$, we can sample from a distribution with $\beta =1$ and couplings $(W,b,c)$ by setting the couplings $J=(W/{\beta}_{\text{text}eff},b/{\beta}_{\text{text}eff},c/{\beta}_{\text{text}eff})$ on the DWave. The effective temperature of the DWave has been shown to be problemdependent and different from the physical temperature of the annealer [Amin2015]. In this work, we will follow a modification of the temperature estimation recipe proposed in [Benedetti2016].
The algorithm follows the following steps:

1.
At each step, take RBM couplings $A=(W,b,c)$. Set couplings on DWave to ${J}_{1}=A/{\beta}_{0}$, with ${\beta}_{0}$ estimated at the previous step (on the first step, we need to take a guess).

2.
Take one set of $n$ samples. We bin the samples into $\lceil \sqrt{2n}\rceil $ bins according to their energy, obtaining probability density estimates ${n}_{1}/n$.

3.
We want a second set that will provide different “enough” samples for distinguishability. Following [Benedetti2016], we take ${J}_{2}=x{J}_{1}$, with $x=1+1/({\beta}_{0}\sigma )$,^{1}^{1} 1 [Benedetti2016] suggests transitioning to a $$ sign in the expression for $x$ once the RBM couplings get large enough. We found that even at late stages, this would result in values of $x$ that are close to zero. where $\sigma $ is the standard deviation of the first sample.

4.
Take a second set of samples and use the same bins as in step 2 to obtain probability density estimates ${n}_{2}/n$.

5.
Denoting the Ising energy of each state with couplings $A$ as $E$, note that {align} n 2 n 1 &= e x β \text eff E/ β 0 Z 2 Z 1 e  β \text eff E/ β 0
⇒log n 2 n 1 &= log Z 1 Z 2 +(1x) β \text eff β 0 E. With this in mind, we can extract an estimate of ${\beta}_{\text{text}eff}$ from the slope of the linear regression between $\mathrm{log}{n}_{2}/{n}_{1}$ and the bin energies, as exemplified in Fig. 2. In order to reduce noise caused by bins with a small number of samples, we limit the regression to bins with at least five samples in both draws.
We can see the results of this temperature estimation procedure throughout training in Fig. 3.
We found some pitfalls in this procedure. Namely, as the couplings of the RBM and therefore the magnitude of the energies involved grow, the distribution of states becomes more and more skewed towards the lower energy states. This is a desirable outcome of training an RBM. However, this leaves the higherenergy bins with a small number of samples, causing large variance in the estimates of $\mathrm{log}({n}_{2}/{n}_{1})$. In all our training runs, this leads to a step where $\mathrm{log}({n}_{2}/{n}_{1})$ happens to fluctuate to a larger value than usual for some of the larger energy bins. This causes ${\beta}_{\text{text}eff}$ to be underestimated at that step. The effect compounds in a few training steps, often leading to negative estimates of ${\beta}_{\text{text}eff}$ and a crash of the algorithm.
Potential solutions include:

•
some regularization to keep the weights from growing. This successfully kept the temperature estimation routine from crashing, but at the cost of impairing the classifier performance of our RBM. This is to be expected, as a welltrained RBM should strongly separate the energies of different states.

•
only estimating $\beta $ during the initial stages of training. This can be a good solution, since $\beta $ does not seem to change by a large amount during training, as we can see in Fig. 3.
Even without a temperature estimation routine, weights growing to be too large is a problem with the algorithm on a QA in general. This is because if weights grow above the maximum coupling that can be implemented on the DWave, we must rescale the weights as in order to set coupling constants on the DWave. However, discretization of the coupling constants means that if one weight is very large, subtle variations between much smaller weights are lost. Another possible solution would be to turn off weight rescaling, but not let couplings grow beyond what is physically implementable on the DWave. Conceptually, this is equivalent to allowing the RBM to learn chains of logical qubits that are strongly coupled. Either of these solutions can impair classifier performance because sometimes RBM might just need very large weights, or might need a large ratio across some weights, to reproduce the probability distribution of the data.
Finally, we have tested whether temperature estimation allows us to take fewer Gibbs steps to reach a Boltzmann distribution. The results of this test are shown in Fig. 1. Once again, the results do not always show a decisive advantage in the number of postprocessing steps needed when using temperature estimation.
1.3 Noise and RBMs
DWave has recently released a lownoise version of its 2000Q quantum computer, with claims to enhancing tunneling rates by a factor of 7.4 [qubits_pres]. It is claimed that this leads to a larger diversity of states returned by the machine, as well as a larger proportion of lowerenergy states. In this section, we test whether these lowernoise properties also help us obtain a more Boltzmannlike distribution from the DWave output. To do this, we train two $12\times 12$ RBM using the temperature estimation techniques described in Sec. 1.2. One of the RBM was trained using the original 2000Q, and the other RBM using the lownoise machine. At each 20 training steps, we compare the distribution obtained using the DWave machine with a Boltzmann distribution obtained from analytically calculated energies for the current RBM couplings. To compare the distributions, we use the KolmogorovSmirnov statistic, which should be close to zero for samples drawn from the same distribution. We compare the KS values as a function of the RBM weight distribution in each machine.
In Fig. 4, we show the mean of the KS statistic binned as a function of the mean and maximum RBM coupling. We see no advantage from using the lowernoise 2000Q in how Boltzmannlike the returned distributions are. For both machines, samples returned are not far from Boltzmann distributions (with KS statistics below 0.1) for low RBM weights, but the distributions diverge from Boltzmann as the weights grow larger.