Restricted Boltzmann Machines for galaxy morphology classification with a quantum annealer

  • 2019-11-14 17:32:30
  • João Caldeira, Joshua Job, Steven H. Adachi, Brian Nord, Gabriel N. Perdue
  • 2

Abstract

We present the application of Restricted Boltzmann Machines (RBMs) to thetask of astronomical image classification using a quantum annealer built byD-Wave Systems. Morphological analysis of galaxies provides criticalinformation for studying their formation and evolution across cosmic timescales. We compress the images using principal component analysis to fit arepresentation on the quantum hardware. Then, we train RBMs with discriminativeand generative algorithms, including contrastive divergence and hybridgenerative-discriminative approaches. We compare these methods to QuantumAnnealing (QA), Markov Chain Monte Carlo (MCMC) Gibbs Sampling, SimulatedAnnealing (SA) as well as machine learning algorithms like gradient boosteddecision trees. We find that RBMs implemented on D-wave hardware perform well,and that they show some classification performance advantages on smalldatasets, but they don't offer a broadly strategic advantage for this task.During this exploration, we analyzed the steps required for Boltzmann samplingwith the D-Wave 2000Q, including a study of temperature estimation, andexamined the impact of qubit noise by comparing and contrasting the originalD-Wave 2000Q to the lower-noise version recently made available. While theseanalyses ultimately had minimal impact on the performance of the RBMs, weinclude them for reference.

 

Quick Read (beta)

1 Boltzmann Distributions on the D-Wave Quantum Annealer

In order to train a Boltzmann machine, we need to sample expectation values from a Boltzmann distribution with β set to 1. We use the Kolmogorov-Smirnov (KS) test to check the statistical consistency of our sample distribution with that of a Boltzmann distribution.

1.1 Comparisons between initializing sampling with an annealer vs a random bitstring

The raw distribution of states coming from a D-Wave 2000Q is often not close to a Boltzmann distribution with β=1. It is “colder”, with a higher propensity for producing states at the lowest energy levels. This energy shift may be advantageous in optimization problems, but RBM training relies on being able to sample from a Boltzmann distribution, so post-processing is generally required.

For us, this process will consist of taking a few steps of Gibbs sampling as a post-processing step. In this section, to check how many steps is enough, we carry out the KS test after each step and keep taking Gibbs steps until the KS p-value rises above 0.05.

The advantage in starting the post-processing using samples from a D-Wave 2000Q is not clear in some of the RBM shown in Fig. 1, for instance in Fig. 1. On the other hand, Fig. 1 and Fig. 1 show some advantages. In all cases, however, there are regions of couplings for which we need quite a few steps, as shown in Fig. 1.

\subfloat

[Couplings are obtained by training a 12×12 RBM on the 2000Q with 10 Gibbs steps. On this test, there is some advantage to using a D-Wave, especially after estimating β.] \subfloat[Couplings are obtained by training a 12×12 RBM on the low-noise 2000Q with 10 Gibbs steps. On this test, the D-Wave needs fewer steps than starting from a random string, though applying β estimation does not help further.]
\subfloat[Couplings are obtained by training a 48×48 RBM on the 2000Q with 10 Gibbs steps. In this case, the D-Wave shows some advantage over random strings.] \subfloat[Couplings are obtained by training a 48×48 RBM on the low-noise 2000Q with 10 Gibbs steps. For these RBM couplings, starting from a random string or the D-Wave samples does not lead to a significantly different number of Gibbs samples needed.]

Figure 1: We present results of the test described in section 1.1, applied to the D-Wave samples after setting the D-Wave couplings to the actual RBM couplings, and to the RBM couplings scaled by an estimated temperature as in section 1.2. Note that the number of Gibbs steps taken until a p-value of 0.05 was reached was capped at 200, and that is why there is some bunching of values at 200 Gibbs steps.

1.2 Temperature estimation

It is possible that the D-Wave returns a Boltzmann distribution, but at a temperature that needs to be determined. If we know the effective inverse temperature β\texteff, we can sample from a distribution with β=1 and couplings (W,b,c) by setting the couplings J=(W/β\texteff,b/β\texteff,c/β\texteff) on the D-Wave. The effective temperature of the D-Wave has been shown to be problem-dependent and different from the physical temperature of the annealer [Amin2015]. In this work, we will follow a modification of the temperature estimation recipe proposed in [Benedetti2016].

The algorithm follows the following steps:

  1. 1.

    At each step, take RBM couplings A=(W,b,c). Set couplings on D-Wave to J1=A/β0, with β0 estimated at the previous step (on the first step, we need to take a guess).

  2. 2.

    Take one set of n samples. We bin the samples into 2n bins according to their energy, obtaining probability density estimates n1/n.

  3. 3.

    We want a second set that will provide different “enough” samples for distinguishability. Following [Benedetti2016], we take J2=xJ1, with x=1+1/(β0σ),11 1 [Benedetti2016] suggests transitioning to a - sign in the expression for x once the RBM couplings get large enough. We found that even at late stages, this would result in values of x that are close to zero. where σ is the standard deviation of the first sample.

  4. 4.

    Take a second set of samples and use the same bins as in step 2 to obtain probability density estimates n2/n.

  5. 5.

    Denoting the Ising energy of each state with couplings A as E, note that {align} n 2 n 1 &= e -x β \text eff E/ β 0 Z 2 Z 1 e - β \text eff E/ β 0
    ⇒log n 2 n 1 &= log Z 1 Z 2 +(1-x) β \text eff β 0 E. With this in mind, we can extract an estimate of β\texteff from the slope of the linear regression between logn2/n1 and the bin energies, as exemplified in Fig. 2. In order to reduce noise caused by bins with a small number of samples, we limit the regression to bins with at least five samples in both draws.

We can see the results of this temperature estimation procedure throughout training in Fig. 3.

Figure 2: Linear regression obtained from equation \eqrefeqn:Testimation_regression for an example step in training a restricted Boltzmann machine.
Figure 3: Temperature estimates over 70 epochs of training (or 8750 training steps) for a 48×48 RBM, plotted using a rolling average over the last 50 steps. It can be seen that the temperature estimates vary significantly over the first stages of training, and later stabilize. This can likely be used to estimate the temperature less often than at every training step.

We found some pitfalls in this procedure. Namely, as the couplings of the RBM and therefore the magnitude of the energies involved grow, the distribution of states becomes more and more skewed towards the lower energy states. This is a desirable outcome of training an RBM. However, this leaves the higher-energy bins with a small number of samples, causing large variance in the estimates of log(n2/n1). In all our training runs, this leads to a step where log(n2/n1) happens to fluctuate to a larger value than usual for some of the larger energy bins. This causes β\texteff to be underestimated at that step. The effect compounds in a few training steps, often leading to negative estimates of β\texteff and a crash of the algorithm.

Potential solutions include:

  • some regularization to keep the weights from growing. This successfully kept the temperature estimation routine from crashing, but at the cost of impairing the classifier performance of our RBM. This is to be expected, as a well-trained RBM should strongly separate the energies of different states.

  • only estimating β during the initial stages of training. This can be a good solution, since β does not seem to change by a large amount during training, as we can see in Fig. 3.

Even without a temperature estimation routine, weights growing to be too large is a problem with the algorithm on a QA in general. This is because if weights grow above the maximum coupling that can be implemented on the D-Wave, we must rescale the weights as in order to set coupling constants on the D-Wave. However, discretization of the coupling constants means that if one weight is very large, subtle variations between much smaller weights are lost. Another possible solution would be to turn off weight rescaling, but not let couplings grow beyond what is physically implementable on the D-Wave. Conceptually, this is equivalent to allowing the RBM to learn chains of logical qubits that are strongly coupled. Either of these solutions can impair classifier performance because sometimes RBM might just need very large weights, or might need a large ratio across some weights, to reproduce the probability distribution of the data.

Finally, we have tested whether temperature estimation allows us to take fewer Gibbs steps to reach a Boltzmann distribution. The results of this test are shown in Fig. 1. Once again, the results do not always show a decisive advantage in the number of post-processing steps needed when using temperature estimation.

1.3 Noise and RBMs

D-Wave has recently released a low-noise version of its 2000Q quantum computer, with claims to enhancing tunneling rates by a factor of 7.4 [qubits_pres]. It is claimed that this leads to a larger diversity of states returned by the machine, as well as a larger proportion of lower-energy states. In this section, we test whether these lower-noise properties also help us obtain a more Boltzmann-like distribution from the D-Wave output. To do this, we train two 12×12 RBM using the temperature estimation techniques described in Sec. 1.2. One of the RBM was trained using the original 2000Q, and the other RBM using the low-noise machine. At each 20 training steps, we compare the distribution obtained using the D-Wave machine with a Boltzmann distribution obtained from analytically calculated energies for the current RBM couplings. To compare the distributions, we use the Kolmogorov-Smirnov statistic, which should be close to zero for samples drawn from the same distribution. We compare the KS values as a function of the RBM weight distribution in each machine.

In Fig. 4, we show the mean of the KS statistic binned as a function of the mean and maximum RBM coupling. We see no advantage from using the lower-noise 2000Q in how Boltzmann-like the returned distributions are. For both machines, samples returned are not far from Boltzmann distributions (with KS statistics below 0.1) for low RBM weights, but the distributions diverge from Boltzmann as the weights grow larger.

\subfloat

[KS statistic as a function of maximum RBM coupling after scaling by the effective β.]
\subfloat[KS statistic as a function of mean RBM coupling after scaling by the effective β.]

Figure 4: We compare the KS statistics between Boltzmann samples and D-Wave samples for two 12 by 12 RBM, one trained on the original 2000Q and one trained on the low-noise 2000Q. We see no advantage in finding a Boltzmann distribution from using the low-noise 2000Q. Note each machine was tested using couplings of an RBM trained on that same machine, so the range of tested couplings differs slightly.

We also try a test similar to Fig. 1, initializing the Gibbs steps with samples from either 2000Q machine. The results are shown in Fig. 5.

Figure 5: Couplings are obtained by training a 48×48 RBM on the 2000Q with 10 Gibbs steps. We see no significant difference between using the 2000Q and the low-noise 2000Q.