QubitHD: A Stochastic Acceleration Method for HD Computing-Based Machine Learning

  • 2019-12-02 13:12:54
  • Samuel Bosch, Alexander Sanchez de la Cerda, Mohsen Imani, Tajana Simunic Rosing, Giovanni De Micheli
  • 0

Abstract

Machine Learning algorithms based on Brain-inspired Hyperdimensional (HD)computing imitate cognition by exploiting statistical properties ofhigh-dimensional vector spaces. It is a promising solution for achieving highenergy-efficiency in different machine learning tasks, such as classification,semi-supervised learning and clustering. A weakness of existing HDcomputing-based ML algorithms is the fact that they have to be binarized forachieving very high energy-efficiency. At the same time, binarized models reachlower classification accuracies. To solve the problem of the trade-off betweenenergy-efficiency and classification accuracy, we propose the QubitHDalgorithm. It stochastically binarizes HD-based algorithms, while maintainingcomparable classification accuracies to their non-binarized counterparts. TheFPGA implementation of QubitHD provides a 65% improvement in terms ofenergy-efficiency, and a 95% improvement in terms of the training time, ascompared to state-of-the-art HD-based ML algorithms. It also outperformsstate-of-the-art low-cost classifiers (like Binarized Neural Networks) in termsof speed and energy-efficiency by an order of magnitude during training andinference.

 

Quick Read (beta)

QubitHD – A Stochastic Acceleration Method
for HD Computing-Based Machine Learning

Samuel Bosch1, Alexander Sanchez de la Cerda1, Mohsen Imani2, Tajana Šimunić Rosing2 and Giovanni De Micheli1
1 École polytechnique fédérale de Lausanne (EPFL) - Integrated Systems Laboratory, Lausanne 1015, Switzerland 2 University of California San Diego - Department of Computer Science and Engineering, La Jolla, CA 92093, USA
Abstract

Machine Learning algorithms based on Brain-inspired Hyperdimensional (HD) computing imitate cognition by exploiting statistical properties of high-dimensional vector spaces. It is a promising solution for achieving high energy-efficiency in different machine learning tasks, such as classification, semi-supervised learning and clustering. A weakness of existing HD computing-based ML algorithms is the fact that they have to be binarized for achieving very high energy-efficiency. At the same time, binarized models reach lower classification accuracies. To solve the problem of the trade-off between energy-efficiency and classification accuracy, we propose the QubitHD algorithm. It stochastically binarizes HD-based algorithms, while maintaining comparable classification accuracies to their non-binarized counterparts. The FPGA implementation of QubitHD provides a 65% improvement in terms of energy-efficiency, and a 95% improvement in terms of the training time, as compared to state-of-the-art HD-based ML algorithms. It also outperforms state-of-the-art low-cost classifiers (like Binarized Neural Networks) in terms of speed and energy-efficiency by an order of magnitude during training and inference.

Brain-inspired computing, Hyperdimensional computing, Energy-efficiency, FPGA Acceleration, Quantum measurements

I Introduction

With the rise of data science and machine learning, as well as the Internet of Things (IoT), the amount of produced data on a daily basis has increased to a level we are barely able to handle [gubbi2013internet]. As the amount of data that needs to be processed is often significantly larger than small-scale and battery-powered devices can handle, so many of these devices are forced to connect to the internet in order to process the data in the cloud. Deep Neural Networks (DNN) are used for complex classification tasks, such as image classification [deng2009imagenet], sentimental analysis (text analysis) or even entertainment [meltem]. However, the complexity of DNN algorithms makes them impractical for real-world applications, such as classification tasks on battery-powered devices. Engineers often face a trade-off, between energy-efficiency and the achieved classification accuracy. Therefore, we need to create light-weight classifiers, which can perform inference on small-scale operating devices.
Brain-inspired Hyperdimensional (HD) computing [kanerva2009hyperdimensional] has been proposed as a light-weight learning algorithm and methodology. The principles governing HD computing are based on the fact that the brain computes with patterns of neural activities which are not directly associated with numbers [kanerva2009hyperdimensional]. Machine learning algorithms based on Brain-inspired HD computing imitate cognition by exploiting statistical properties of very high-dimensional vector spaces. The first step in HD computing is to map each data point into a high-dimensional space (e.g., 10,000 dimensions). During training, HD computing linearly combines the encoded hypervectors to create a hypervector representing each class. During the inference, classification is done by calculating the cosine similarity between the encoded query hypervector and all class hypervectors. The algorithm then predicts the class with the highest similarity score. In case of multiple classes with high similarity, the algorithm is likewise suited to express the confidence in the correctness of a prediction.

Many publications on Brain-inspired HD computing argue that for most practical applications, HD computing has to be trained and tested using floating point, or at least integer values [morris2019comphd, imani2019sparsehd]. Binarized HD computing models provided low classification accuracies. Often too low for practical applications. A recently published algorithm, called QuantHD [QuantHD], revealed the existence of a method to significantly improving the classification accuracies of binarized and ternarized models. Nevertheless, there still exists a large gap between the classification accuracy of non-binarized and binarized HD computing classifiers. Also, such methods increase the required training time and are unstable as they tend to get stuck in local minima during training. In this paper, we propose a new method which can both reduce this classification accuracy gap by between a third and a half whilst, simultaneously, improving energy efficiency during training by 60% on average. It also makes the training more stable by introducing randomness. We call this technique QubitHD, as it is based on the principle of information being stored in a quantum bit (Qubit) before its measurement. The floating point values represent the quantum state, while the binarized values represent the quantum state after a measurement had been performed.

The main contributions of the paper are the following:

  • We decreased the gap in classification accuracy between binarized and non-binarized state-of-the-art HD computing-based ML algorithms by 38.8% on average.

  • We decrease the convergence time in the range of 30-50%. Introducing randomness in the algorithms prevents it from getting stuck in small, local minima, and incites the algorithm to quickly move towards the optimal value. The reason why the authors of [QuantHD] had problems with slow convergence was precisely this: lack of randomness.

  • We stop the algorithm from getting stuck in local minima during training, by introducing randomness in the convergence process.

  • QubitHD performs similarity check by calculating the Hamming distance between the hypervectors instead of calculating the costly cosine similarity.

  • We implemented the algorithm on GPU and FPGA, which accelerates training and inference. We also evaluated several classification problems, including human activity, face and text recognition. When looking at energy efficiency and speed, the FPGA implementation of QubitHD provides on average a 56× and 8× energy-efficiency improvement and speedup during training, as compared to state-of-the-art HD computing algorithms [ISLPED]. For comparison purposes, the authors of [QuantHD] only achieve 34.1× and 4.1× energy efficiency improvement and speedup during the training against the same state-of-the-art HD computing algorithms. When comparing QubitHD with multi-layer perceptron (MLP) and binarized neural network (BNN) classifiers, we observe that QubitHD can provide 56× and 52× faster computing in training and testing respectively, while providing similar classification accuracies (see Table III).

Fig. 1: QubitHD classification accuracies during different retraining iterations compared to QuantHD. It is evident, that QubitHD consistently outperforms in terms of classification-accuracy. It does so, without utilizing additional hardware resources.

II Hyperdimensional Computing

The applications of brain-inspired HD computing in machine learning are divers. In this publication, we only focus on supervised classification tasks, but a recent publication indicated that HD computing-based ML algorithms can be applied to clustering and semi-supervised learning as well [SemiHD]. The basis of QubitHD is described in Figure 2. The core difference to QuantHD is the binarization step that is discussed in Section III. The non-binarized algorithm with retraining consists of the following steps:

II-A Encoding

Fig. 2: Overview of a simple HD computing-based machine learning algorithm for classification on the left.
The encoding scheme used in this publication is illustrated on the right.

The training dataset is pre-processed by converting all datapoints into the very high-dimensional vectors (hypervectors). We used hypervectors of length D=10,000 in this paper, as it is the standard baseline for all HD computing-based machine learning algorithms. Like explained in [QuantHD], the original data is assumed to have n features: 𝐟={f1,fn}. The objective is encoding each feature that corresponds to each datapoint into the hypervector of dimension D (D=10,000 in this paper). Each feature vector ”memorizes” the value and position of the relevant feature. In order to take into account the position of each feature, we use a set of randomly generated base hypervectors {𝐁i,𝐁2,,𝐁n}, where n is the total number of features in each data point (𝐁i{-1,1}D). Since the base-hypervectors are uniformly generated at random (with equal probability for -1 and 1), they are all mutually orthogonal [kanerva2000random]. The cosine product between hypervectors ranges between cos(H1,H2)[-1,1]. The expected cosine product of independent and randomly generated base-hypervectors is 𝔼[cos(𝐁i,𝐁j)]=0, whereas the variance is V[cos(𝐁i,𝐁j)]=1D0 for Dn (random walk) for ij. Thereby, the hypervectors are almost orthogonal. This is true only when the number of randomly generated base-hypervectors is significantly smaller than the dimension of the entire vectors space D. For comparison of the binarized hypervectors, we use the Hamming distance. Therefore:

𝔼[δ(𝐁i,𝐁j)]=D/2   (forij).

Here δ is the Hamming distance similarity between the two binarized base-hypervectors.

We also distinguish the actual value of each feature with different hypervectors. One way of doing so is the following. We find the minimum and maximum value of the feature across the entire dataset, and generate two distinct (and random) base hypervectors representing those two values. Every feature which has a value in-between the minimum and maximum, will be associated with a hypervector proportionate to the two min and max base-hypervectors. Feature values that are close to the minimum will be highly correlated with the base-hypervector corresponding to the minimum value (cos(𝐋i,𝐋min)1, cos(𝐋i,𝐋max)0). Feature values just in the middle between the minimum and the maximum will be 50% correlated with both base-hypervectors (cos(𝐋i,𝐋min)12, cos(𝐋i,𝐋max)12). The equivalent principle applies to feature values close to the maximum value (cos(𝐋i,𝐋min)0, cos(𝐋i,𝐋max)1).

Finally, we add all of the results for all the features:

𝐇=𝐁1𝐋¯1+𝐁2𝐋¯2++𝐁n𝐋¯n (1)

where denotes the XOR operation.

II-B Initial training

The first training round is performed by summing up all hypervectors pertaining to the same class. That is, we abstract all hypervectors with the same labels. This method is called one-shot learning and is, at the moment, the most widespread way of using HD computing in machine learning [ISLPED, Mitra_HD_1, Mitra_HD_2, rahimi2016hyperdimensional, ISSCC]. We now have one matrix C of size m×D (Cm×D), where m is the number of existing classes and D is the length of the hypervectors.

II-C Retraining

The classification accuracy of the current model during the inference is low [QuantHD]. For this reason, we have to do retraining. As displayed in Figure 3, retraining is done in the following way. We go through the entire dataset of encoded datapoints and test them to ascertain if they are correctly classified by the current model C. For every misclassified datapoint, we have to make additional improvements to the model. Let us assume that the correct label of a datapoint is k, but it was incorrectly classified as l. We now add the erroneously classified hypervector to its corresponding row Ck. (to make them more similar). We also subtract the incorrectly classified hypervector from the row corresponding to the inaccurately predicted class Cl (to make them more distinct). In order to decrease the convergence of time and noise, it is common practice to introduce a learning rate α in this step as illustrated in Figure 3a [imani2019adapthd]. This process is repeated several times.

II-D Inference

During the inference, we predict the class to which the datapoint belongs. This datapoint is encoded as described in II-A, and then compared to all the class hypervectors. The algorithm then predicts the class with the largest cosine similarity.

II-E Binarization:

So far, we described in the algorithm that the trained model has non-binarized elements with floating point values. Many existing HD computing methods [rahimi2017high, imani2017exploring, rahimi2017hyperdimensional2] binarize the class hypervectors to eliminate costly cosine operations used for the associative search (Cm×DCbinarized{-1,1}m×D). Binary hypervectors do not provide sufficiently high classification accuracies on many (if not most) real-world applications. The usual way of binarizing class hypervectors is making all positive values equal to +1 and negative values equal to -1. This method suffers from significant loss of information about the trained model. To the best of our knowledge, [QuantHD] was the first publication demonstrating a method of achieving high classification accuracy while using a binarized (or quantized) HD model. Instead of just ”blindly” binarizing the class hypervectors after every re-training iteration, QuantHD trains the model in a way that is optimized for binarized hypervectors. That is, during every single retraining iteration, they create an additional binarized model. Doing so requires no additional computational power, as the binary representation of numbers in usual computer architectures reserves the first bit for the sign (0 stands for positive, 1 for negative). The QuantHD algorithm retrains on the predictions of the binarized model, while updating the non-binarized model as described in subsection II-C and Figure 3. The binarized model achieves, after several iterations, very high classification accuracies. They are significantly higher than they would be without binary-optimized retraining.

III Stochastic binarization

Motivation: In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a two-level quantum-mechanical system, solely entailing two possible states. An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. In a classical system, the bit would have to be either in one state or the other (either logical 0 or 1). A quantum system can be in any superposition of those two states. When measuring the z-component of the spin of a particle, which can have any value in the range sz[-1,1], the quantum state is going to collapse in either the sz+1 or the sz-1 state, with a probability directly related to sz before the measurement.

This leaves us with a very interesting property: there exists a ”true” z-component of the state, which is equal to the projection of the quantum state on the z-axis. This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the non-binarized model. In other words 𝔼[Cbin]=Cnon-bin. The QuantHD algorithm uses the following (very trivial) binarization function:

bin(x)={1,if x0-1,otherwise (2)

We propose using the following method instead:

qbin(x)={1,if x>b1,if \absxb, then with probability 12+x2b-1,if \absxb, otherwise -1,if x<-b (3)

where b is the cutoff value defined as a fixed fraction of the standard deviation σ of the data. It is discussed in greater detail in subsection IV-B. The advantage of doing so is the fact that the expected value of qbin(x) for x[-b,b] is proportional to x:

𝔼[qbin(x)]=(+1)(12+x2b)+(-1)(12-x2b)=xb

IV Proposed QubitHD algorithm

Motivation

The QuantHD algorithm still leaves us with a significant gap between the maximum classification accuracy of the floating-point model and the binarized one. Also, the QuantHD retraining method described in [QuantHD] tends to get ”stuck” in local minima. Further, their algorithm almost doubles the average convergence rate, which hereafter increases the energy consumption during training.
To summarize, here are the main problems with the QuantHD algorithm, which the QubitHD algorithm can either solve or improve:

  1. 1.

    There still exists a significant gap between the binarized model and non-binarized model accuracy

  2. 2.

    The algorithm can sometimes get stuck in local minima, which makes it unrealiable.

  3. 3.

    The convergence time of QuantHD algorithm is almost twice as slow as compared to the other state-of-the-art HD computing algorithms with retraining

IV-A Overview of the QubitHD algorithm

Fig. 3: (a) QubitHD framework overview (including the initial training and the retraining of the non-binarized model based on the binarized model). (b) Efficient inference from model

In this section, we will present the QubitHD algorithm. It enables efficient binarization of the HD model with particularly minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:

  1. 1.

    Encoding: see Subsection II-A and Figure 2b

  2. 2.

    Initial training: see Subsection II-B and Figure 3a

  3. 3.

    Stochastic binarization which binarizes the model by using Equation (3), instead of Equation (2). More details can be found in Subsection (IV-B).

  4. 4.

    Binarized retraining The retraining steps compensate for the classification accuracy loss during the previous step. Binarization through Equation (3) is performed on the model after every single retraining iteration. This ensures a fast convergence towards a consistently high classification accuracy of the ML model.

IV-B Framework of the QubitHD algorithm

In this section, we present the QubitHD algorithm. It enables efficient binarization of the HD model with minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:

1) Encoding: This part is described in detail in Subsection II-A and Figure 2b

2) Initial training: QubitHD trains the class hypervectors by summing all the encoded datapoints corresponding to the same class as seen in Figure 3a

It is evident from Figure 3a that every accumulated hypervector represents one class. As explained in [QuantHD], in an application with k classes, the initially trained HD model contains k non-binarized hypervectors {𝐂1,,𝐂k}, where 𝐂iD (

1
).

3) Stochastic binarization: This part is the main change with respect to the QuantHD algorithm. A given class hypervector is created by summing together random hypervectors 𝐡 of the type 𝐡{-1,1}D. Every element Cij in a class hypervector 𝐂𝐢 (of class i) is the product of a ”random walk”. In other words, its distribution follows a binomial distribution with a probability mass function (pmf):

p(Cij=k)=(nk)pk(1-p)n-k=(nk)12n (4)

where p=12 (as we have equal probabilities for hj=-1 and hj=+1), n is the number of randomly summed hypervectors for class i, and k is a possible value Cij can take. Note that Cij{-n,,0,,n}.

The expected value 𝔼[Cij]=t=1n𝔼[hj(t)]=0, while the standard deviation is

𝔼[Cij2]-𝔼[Cij]2=t=1n𝔼[hj(t)2]+21t,ln𝔼[hj(t)hj(l)]=n

Assuming that the number of hypervectors corresponding to every class in the dataset is large enough, the normal distribution is a good approximation for modeling the binomial distribution:

𝒩(μ=0,σ=n)=12nπe-x22n

In previous publications [QuantHD], the way of binarizing a model C was described by Equation 2. We instead propose using Equation 3 shown in Figure 4. Implementing this change requires almost no additional resources (the random flips have to be performed only once per retraining round), but leads to significant improvements in terms of accuracy, reliability, speed and energy efficiency. The accuracy improvement is due to the fact that the expected value of this stochastically binarized model is equal to the non-binarized model. The reliability and speed improvement are due to the fact that the model quickly ”jumps” out of local minima, as opposed to getting stuck for several iterations. The energy consumption during training depends on the number of retraining iterations, which are significantly reduced.

Fig. 4: Eq. 2 from QuantHD    vs.       Eq. 3 from QubitHD       
Deterministic binarization vs semi-stochastic binarization.

Just as we have demonstrated, the encoded data we are processing is (approximately) normally distributed. In order to be able to use Equation 3 for the binarization process, we have to define a ”cutoff” value b. That is, everything above +b will become +1, and everything below -b will become -1. Only values between -b and +b will be better approximated through Equation 3. In most cases, b has to be smaller than the standard deviation of the data distribution σ. If we would use bσ, our model would become almost completely random, as most of the values are contained in the [-σ,+σ] interval (68% to be precise).

The reason why the qbin(x) binarization works better than the bin(x) lies in the fact that, taking into account the expected value of the binarized model, it is equal to actual values in the non-binarized model, with the exception of values below -b and above +b. Empirically, we also noticed that the randomness of qbin(x) prevents the algorithm from getting stuck in local minima during training, which reduces the convergence time by 50% on average.

What if we use the encoding method that allows the hypervectors of the encoded datapoints to be floating-point values, as opposed to {-1,+1} values only? In this case, (due to randomness) it is to be assumed that we are sampling from an approximately uniform distribution. This assumption is only reasonable, if the encoding scheme is also based on random base hypervectors Bi{-1,+1}. The resulting probability distribution won’t be a discrete binomial distribution, but rather a slightly modified version of the continuous Irwin-Hall distribution [hall1927distribution] given by the probability density function (pdf):

f(x;n)=12(n-1)!k=0x+n2(-)k(nk)(x+n2-k)n-1 (5)

where n is the number of uniform intervals [-1,1] summed together. Just as with the binomial distribution, the Irwin-Hall distribution also converges towards a normal distribution for large n, with mean μ=0 and standard deviation σ=n3. It is possible to show that for a large n, this distribution converges towards the normal distribution. This was to be expected because of the central limit theorem. Therefore, even in the case of floating-point numbers in the encoding, we can use the same QubitHD technique.

V Possible FPGA Implementation

It is known that HD computing-based machine learning algorithms can be implemented in a wide range of different hardware platforms, such as CPUs, GPUs and FPGAs. As most of the training and all of the inference rely on bit-wise operations, it was proposed in [QuantHD] that FPGAs would be a suitable candidate for the efficient hardware acceleration. The same hardware can be used for implementing both, the QuantHD and the QubitHD algorithm. This is also one of the major advantages of QubitHD, as it doesn’t require costly hardware upgrades from previous models.

VI Evaluation

VI-A Experimental Setup

The training and inference of the algorithm were implemented and verified using Verilog and the code was synthesized on the Kintex-7 FPGA KC705 Evaluation Kit. The Vivado XPower tool has been used to estimate the device power. Additionally, for testing purposes, all parts of the QubitHD algorithm have been implemented on CPU. We also implemented the algorithm on an embedded device (Rasberry Pi 3) with an ARM Cortex A54 CPU. For the purposes of making an accurate and fair comparison, we use the following FPGA-implemented algorithms as baselines:

  • The QuantHD algorithm from [QuantHD], on which QubitHD is based

  • Other state-of-the-art HD computing-based machine learning algorithms [ISLPED, imani2019adapthd, morris2019comphd]

  • A multi level perceptron (MLP) [sharma2016high] (see Table III)

  • A binary neural network (BNN) [umuroglu2017finn] (see Table III)

With a view to show the advantage of the QubitHD and the previous [QuantHD] algorithm, we used the datasets summarized in Table I. The datasets range from small datasets like UCIHAR and ISOLET (frequently used in IoT devices, for which QubitHD is specially created), to the larger datasets like face recognition.

TABLE I: Datasets (n: number of features, k: number of classes).
n K Data Size Train Size Test Size Description
ISOLET 617 26 19MB 6,238 1,559 Voice Recognition [Isolet]
UCIHAR 561 12 10MB 6,213 1,554 Activity recognition(Mobile)[anguita2012human]
MNIST 784 10 220MB 60,000 10,000 Handwritten digits [lecun2010mnist]
FACE 608 2 1.3GB 522,441 2,494 Face recognition[kim2017orchard]
EXTRA 225 4 140MB 146,869 16,343 Phone position recognition[vaizman2017recognizing]

VI-B Accuracy

The evaluation of the baseline HD model provides high-classification accuracy when using non-binarized hypervectors for classification. The problem, however, is that retraining and inference with a non-binary class hypervector is very costly, as it requires calculating cosine similarities11 1 The cosine similarity of vectors 𝐚 and 𝐛 is calculated as cos(𝐚,𝐛)=𝐚𝐛||𝐚||||𝐛||, where is the dot product and |||| is the absolute value of the vector.. That is, for k-bit integers or floating point numbers 𝒪(Dk2) basic operations need to be performed through every step. These are costly and impractical on small-scale and battery-powered devices.

Similarly, during inference, the associative search between query and trained model requires the calculation of the costly cosine similarities. To address this issue, many HD computing-based machine learning algorithms binarize their models [ISLPED]. That way, the cosine similarity is replaced by a simple Hamming distance similarity check. The key problem with this approach is that it leads to a significant decrease in classification accuracy, as shown in Table II.

The authors of [QuantHD] already showed the existence of a partial solution to this problem, which involves simultaneously retraining the non-binarized model, while updating the binarized model. We already listed the problems with this model in subsection IV. What especially motivated us to create the stable and more reliable QubitHD algorithm, is the fact that the QuantHD algorithm’s retraining is unstable and unreliable. After extensively testing the QubitHD algorithm, we conclude that it on average closes the gap of classification accuracy by 38.8% as compared to the baseline HD computing-based machine learning algorithms in [ISLPED] using the QuantHD framework (See Table II).

Additionally, we observe that the accuracies of the QubitHD algorithm, using a binary model, are 1.2% and 60% higher than the classification accuracies of the baseline HD computing-based algorithm using non-quantized and binary respectively.

TABLE II: Comparison of QubitHD classification accuracy with the state-of-the-art HD computing.

Baseline HD QuantHD QubitHD Non-Quantized Binary Non-Quantized Binary Binary with randomized flip ISOLET 91.1% 88.1% 95.8% 94.6% 95.3% UCIHAR 93.8% 77.4% 95.9% 93.0% 94.1% MNIST 88.1% 32.70% 91.2% 87.1% 88.3% FACE 95.9% 68.4% 96.2% 94.6% 95.4% Mean 92.23% 65.9% 94.78% 92.33% 93.28

Fig. 5: Convergence of QuantHD vs. QubitHD. Here we can see the maximum accuracies achievable by a given number of iterations. It is clear that QubitHD converges much faster than QuantHD, as a result of the stochastic binarization process. Intuitively, any binarized model cannot outperform the non-binarized model, hence the theoretical limit of the accuracy is fully closing the gap.

VI-C Training Efficiency

The training efficiency of HD-based algorithms is characterised by an initial training of the model and subsequent retraining.

  • Algorithms in this type all consume the same energy during the generation of the initial training model

  • The significant cost is the retraining: compared to the non-binarized model, QuantHD uses less operations when calculating the hypervector similarities (step 4 in Figure 3).

  • No complex cosine similarity has to be computed as calculating the Hamming distance is sufficient to determine whether there was a correct classification or not

  • The improvement of QubitHD lies in the faster convergence to a high classification accuracy, which is 30-50% faster than in QuantHD and also decreases the energy consumption after the initial training proportionally.

  • The QubitHD modification has dual benefit. It makes it possible to save energy and time during training, whilst achieving the same or better classification accuracies during testing depending on whether the goal is rapid convergence or high classification accuracy.

Fig. 6: Energy consumption and execution time of initial training, QuantHD re-training and QubitHD re-training on a FPGA

VI-D Inference Efficiency

Compared to QuantHD, there is no gain or loss in the time execution or energy efficiency, since the models behave identically once they are trained. So we report the same 44 × energy efficiency improvement and 5 × speedup as compared to the non-binarized HD algorithm.

VI-E QubitHD comparison with MLP and BNN

TABLE III: Comparison of QubitHD with MLP and BNN in terms of accuracy, efficiency, and model size. [QuantHD]

MLP/BNN Topologies Accuracy CPU Training (s) FPGA Inference (μs) Model Size MLP BNN QubitHD MLP BNN QubitHD MLP BNN QubitHD MLP BNN QubitHD ISOLET 617-512-256-26 95.8% 96.1% 95.3% 2.08 17.69 0.19 27.39 5.24 0.28 1.81MB 56.7KB 65.0KB UCIHAR 561-512-256-12 97.3% 95.9% 94.1% 1.04 8.32 0.08 21.43 5.18 0.27 1.68MB 52.7KB 30.0KB FACE 608-512-256-2 96.1% 96.1% 95.4% 0.56 4.30 0.03 17.68 5.11 0.24 1.77MB 55.3KB 5.0KB

QubitHD is a classifier intended to run on low powered devices, specifically with the goal of the low energy consumption and fast and efficient execution in mind. As such we set out to compare QubitHD, not only to Quant HD, but also other non-HD light-weight classifiers. In our analysis we compared QubitHD accuracy and efficiency with the state-of-the-art light-weight classifiers, including Multi-Layer Perceptron (MLP) and Binarized Neural Network (BNN). For MLP and BNN, we aimed at using the same metric as employed in [umuroglu2017finn] with the small modification in input and output layers in order to run different applications. The results of this, presented in Table III, indicate that QubitHD, while having similar classification accuracies to very light-weight classifier BNNs and MLPs, drastically reduces CPU usage during training and execution time during the inference. In particular, compared to MLPs QubitHD uses 12 × less CPU during training and is 84 × faster during the inference on FPGAs. In comparison with BNNs, QubitHD uses a factor of 101 × less CPU time during training and is still 20 × faster during the inference.

VII Conclusion

Machine learning algorithms, based on the Brain-inspired Hyperdimensional (HD) computing, imitate cognition by exploiting statistical properties of very high-dimensional vector spaces. They are a promising solution for energy-efficient classification tasks. A weakness of existing HD computing-based ML algorithms is the fact that they have to be binarized for achieving very high energy-efficiency. At the same time, binarized models reach lower classification accuracies. In order to solve the problem of the trade-off between the energy-efficiency and classification accuracy, we propose the QubitHD algorithm. With QubitHD, it is possible to use binarized HD computing-based ML algorithms, which provide virtually the same classification accuracies as their non-binarized counterparts. The algorithm is inspired by stochastic quantum state measurement techniques. The improvement of QubitHD is a duality and is reflected in the quicker convergence and the higher and more stable classification accuracy achieved, as compared to QuantHD.

Our main contributions are:

  1. 1.

    The FPGA implementation of QubitHD provides on average a 65% improvement in terms of energy-efficiency, and a 95% improvement in terms of training time, as compared to the most recent state-of-the-art HD computing-based machine learning algorithm QuantHD [QuantHD]

  2. 2.

    When compared with state-of-the-art low-cost classifiers like Binarized Neural Networks (BNN) and Multi-Layer Perceptrons (MLP), QubitHD offers a similar classification accuracy whilst reducing training time by 56× and allows for 52× faster inference when testing.

  3. 3.

    QubitHD decreases the classification accuracy gap between state-of-the-art binarized and non-binarized HD models by almost half.

  4. 4.

    QubitHD converges on average 40% faster during training, thus significantly decreasing the energy consumption.

References

Appendix A Appendix A: Quantum bit (Qubit) measurements

Appendix A: Quantum bit (Qubit) measurements:
In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a two-level quantum-mechanical system, solely entailing two possible states (usually corresponding to two distinct energy levels). An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. Another example could be an atom which can be in the ground state (with low energy) and in the excited state (with a higher energy level). In a classical system, the bit would have to be either in one state or the other (either logical 0 or 1). A quantum system can be in an superposition of those two states. Let us define two possible states of the system as |0=(10) and |1=(01). An arbitrary quantum state |ψ can be in a superposition of these two states: |ψ=α0|0+α1|1=(α0α1), where \absα02+\absα12=1. For the sake of completeness, let us also define the counterpart of |ψ=(α0α1), which is ψ|=(α0*α1*). |ψ is called the ket-notation, while ψ| is called the bra-notation. As a quantum state is directly related to the probability of finding the system in a certain state after measurement, every quantum state has to be normalized. That is, \absψ2=\absα02+\absα12=1.

It might seem as if there were four degrees of freedom in |ψ=α0|0+α1|1, since there are two complex numbers with two degrees of freedom. However, one degree of freedom is removed by the normalization constraint \absα02+\absα12=1, and the other degree of freedom is removed by the fact that the arbitrary overall phase of the quantum state eiα doesn’t matter, as it has no influence on any physical observables in the one-qubit case. Hence, we can represent the state of the quantum bit as a point on a sphere of radius r=1. This is the so called Bloch sphere, displayed in the Figure (7).

Fig. 7: Bloch sphere representation of a quantum bit (qubit) state |ψ. The projection on the z-axis determines the probability of measuring a |0 or a |1 state.

For simplicity, we are going to introduce new coordinates for representing the quantum state:

α0=cosθ2  and  α1=eiϕsinθ2 (6)

Here θ and ϕ are the polar coordinates.

While an ordinary (classical) bit can store only one ”piece of information”, a quantum state is much more powerful, as it can be in any state with θ[0,π] and ϕ[0,2π]. The problem with quantum states is the fact that we cannot measure or observe which state they are in. Whenever we attempt to observe which state a qubit is in, it is going to collapse into a (classical) bit state. In the Bloch representation, whenever we attempt to measure the z-component of the qubit, it will either collapse into the |0+sz+1 state, or into the |1-sz-1 state. The probability of it collapsing into state |0 is p(0)=\abs0||ψ2=\abs0|(α0|0+α1|1)2=\absα02. Analogy, the probability of it collapsing into state |1 is p(1)=\abs1||ψ2=\abs1|(α0|0+α1|1)2=\absα12=1-p(0).

This leaves us with a very interesting property: there exists a ”true” z-component of the state, which is equal to the projection of the quantum state on the z-axis: cos(θ). This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the non-binarized model. In other words 𝔼[Cbin]=Cnon-bin. This is the reason why we use Eq. 3 for binarization, as opposed to Eq. 2.

It is similar to the property we observe when measuring a quantum state in the z direction. The probability of seeing one of the two possible measurement outcomes (-1 and +1) is proportional to the projection on the z-axis of the quantum state, which we cannot measure directly.