Machine Learning algorithms based on Brain-inspired Hyperdimensional (HD)computing imitate cognition by exploiting statistical properties ofhigh-dimensional vector spaces. It is a promising solution for achieving highenergy-efficiency in different machine learning tasks, such as classification,semi-supervised learning and clustering. A weakness of existing HDcomputing-based ML algorithms is the fact that they have to be binarized forachieving very high energy-efficiency. At the same time, binarized models reachlower classification accuracies. To solve the problem of the trade-off betweenenergy-efficiency and classification accuracy, we propose the QubitHDalgorithm. It stochastically binarizes HD-based algorithms, while maintainingcomparable classification accuracies to their non-binarized counterparts. TheFPGA implementation of QubitHD provides a 65% improvement in terms ofenergy-efficiency, and a 95% improvement in terms of the training time, ascompared to state-of-the-art HD-based ML algorithms. It also outperformsstate-of-the-art low-cost classifiers (like Binarized Neural Networks) in termsof speed and energy-efficiency by an order of magnitude during training andinference.
Quick Read (beta)
QubitHD – A Stochastic Acceleration Method
for HD Computing-Based Machine Learning
Machine Learning algorithms based on Brain-inspired Hyperdimensional (HD) computing imitate cognition by exploiting statistical properties of high-dimensional vector spaces. It is a promising solution for achieving high energy-efficiency in different machine learning tasks, such as classification, semi-supervised learning and clustering. A weakness of existing HD computing-based ML algorithms is the fact that they have to be binarized for achieving very high energy-efficiency. At the same time, binarized models reach lower classification accuracies. To solve the problem of the trade-off between energy-efficiency and classification accuracy, we propose the QubitHD algorithm. It stochastically binarizes HD-based algorithms, while maintaining comparable classification accuracies to their non-binarized counterparts. The FPGA implementation of QubitHD provides a 65% improvement in terms of energy-efficiency, and a 95% improvement in terms of the training time, as compared to state-of-the-art HD-based ML algorithms. It also outperforms state-of-the-art low-cost classifiers (like Binarized Neural Networks) in terms of speed and energy-efficiency by an order of magnitude during training and inference.
With the rise of data science and machine learning, as well as the Internet of Things (IoT), the amount of produced data on a daily basis has increased to a level we are barely able to handle [gubbi2013internet]. As the amount of data that needs to be processed is often significantly larger than small-scale and battery-powered devices can handle, so many of these devices are forced to connect to the internet in order to process the data in the cloud. Deep Neural Networks (DNN) are used for complex classification tasks, such as image classification [deng2009imagenet], sentimental analysis (text analysis) or even entertainment [meltem]. However, the complexity of DNN algorithms makes them impractical for real-world applications, such as classification tasks on battery-powered devices. Engineers often face a trade-off, between energy-efficiency and the achieved classification accuracy. Therefore, we need to create light-weight classifiers, which can perform inference on small-scale operating devices.
Brain-inspired Hyperdimensional (HD) computing [kanerva2009hyperdimensional] has been proposed as a light-weight learning algorithm and methodology. The principles governing HD computing are based on the fact that the brain computes with patterns of neural activities which are not directly associated with numbers [kanerva2009hyperdimensional]. Machine learning algorithms based on Brain-inspired HD computing imitate cognition by exploiting statistical properties of very high-dimensional vector spaces. The first step in HD computing is to map each data point into a high-dimensional space (e.g., dimensions). During training, HD computing linearly combines the encoded hypervectors to create a hypervector representing each class. During the inference, classification is done by calculating the cosine similarity between the encoded query hypervector and all class hypervectors. The algorithm then predicts the class with the highest similarity score. In case of multiple classes with high similarity, the algorithm is likewise suited to express the confidence in the correctness of a prediction.
Many publications on Brain-inspired HD computing argue that for most practical applications, HD computing has to be trained and tested using floating point, or at least integer values [morris2019comphd, imani2019sparsehd]. Binarized HD computing models provided low classification accuracies. Often too low for practical applications. A recently published algorithm, called QuantHD [QuantHD], revealed the existence of a method to significantly improving the classification accuracies of binarized and ternarized models. Nevertheless, there still exists a large gap between the classification accuracy of non-binarized and binarized HD computing classifiers. Also, such methods increase the required training time and are unstable as they tend to get stuck in local minima during training. In this paper, we propose a new method which can both reduce this classification accuracy gap by between a third and a half whilst, simultaneously, improving energy efficiency during training by 60% on average. It also makes the training more stable by introducing randomness. We call this technique QubitHD, as it is based on the principle of information being stored in a quantum bit (Qubit) before its measurement. The floating point values represent the quantum state, while the binarized values represent the quantum state after a measurement had been performed.
The main contributions of the paper are the following:
We decreased the gap in classification accuracy between binarized and non-binarized state-of-the-art HD computing-based ML algorithms by 38.8% on average.
We decrease the convergence time in the range of 30-50%. Introducing randomness in the algorithms prevents it from getting stuck in small, local minima, and incites the algorithm to quickly move towards the optimal value. The reason why the authors of [QuantHD] had problems with slow convergence was precisely this: lack of randomness.
We stop the algorithm from getting stuck in local minima during training, by introducing randomness in the convergence process.
QubitHD performs similarity check by calculating the Hamming distance between the hypervectors instead of calculating the costly cosine similarity.
We implemented the algorithm on GPU and FPGA, which accelerates training and inference. We also evaluated several classification problems, including human activity, face and text recognition. When looking at energy efficiency and speed, the FPGA implementation of QubitHD provides on average a 56 and 8 energy-efficiency improvement and speedup during training, as compared to state-of-the-art HD computing algorithms [ISLPED]. For comparison purposes, the authors of [QuantHD] only achieve 34.1 and 4.1 energy efficiency improvement and speedup during the training against the same state-of-the-art HD computing algorithms. When comparing QubitHD with multi-layer perceptron (MLP) and binarized neural network (BNN) classifiers, we observe that QubitHD can provide 56 and 52 faster computing in training and testing respectively, while providing similar classification accuracies (see Table III).
II Hyperdimensional Computing
The applications of brain-inspired HD computing in machine learning are divers. In this publication, we only focus on supervised classification tasks, but a recent publication indicated that HD computing-based ML algorithms can be applied to clustering and semi-supervised learning as well [SemiHD]. The basis of QubitHD is described in Figure 2. The core difference to QuantHD is the binarization step that is discussed in Section III. The non-binarized algorithm with retraining consists of the following steps:
The training dataset is pre-processed by converting all datapoints into the very high-dimensional vectors (hypervectors). We used hypervectors of length in this paper, as it is the standard baseline for all HD computing-based machine learning algorithms. Like explained in [QuantHD], the original data is assumed to have features: . The objective is encoding each feature that corresponds to each datapoint into the hypervector of dimension ( in this paper). Each feature vector ”memorizes” the value and position of the relevant feature. In order to take into account the position of each feature, we use a set of randomly generated base hypervectors , where is the total number of features in each data point (). Since the base-hypervectors are uniformly generated at random (with equal probability for and ), they are all mutually orthogonal [kanerva2000random]. The cosine product between hypervectors ranges between . The expected cosine product of independent and randomly generated base-hypervectors is , whereas the variance is for (random walk) for . Thereby, the hypervectors are almost orthogonal. This is true only when the number of randomly generated base-hypervectors is significantly smaller than the dimension of the entire vectors space . For comparison of the binarized hypervectors, we use the Hamming distance. Therefore:
Here is the Hamming distance similarity between the two binarized base-hypervectors.
We also distinguish the actual value of each feature with different hypervectors. One way of doing so is the following. We find the minimum and maximum value of the feature across the entire dataset, and generate two distinct (and random) base hypervectors representing those two values. Every feature which has a value in-between the minimum and maximum, will be associated with a hypervector proportionate to the two and base-hypervectors. Feature values that are close to the minimum will be highly correlated with the base-hypervector corresponding to the minimum value (, ). Feature values just in the middle between the minimum and the maximum will be 50% correlated with both base-hypervectors (, ). The equivalent principle applies to feature values close to the maximum value (, ).
Finally, we add all of the results for all the features:
where denotes the XOR operation.
II-B Initial training
The first training round is performed by summing up all hypervectors pertaining to the same class. That is, we abstract all hypervectors with the same labels. This method is called one-shot learning and is, at the moment, the most widespread way of using HD computing in machine learning [ISLPED, Mitra_HD_1, Mitra_HD_2, rahimi2016hyperdimensional, ISSCC]. We now have one matrix of size (), where is the number of existing classes and is the length of the hypervectors.
The classification accuracy of the current model during the inference is low [QuantHD]. For this reason, we have to do retraining. As displayed in Figure 3, retraining is done in the following way. We go through the entire dataset of encoded datapoints and test them to ascertain if they are correctly classified by the current model . For every misclassified datapoint, we have to make additional improvements to the model. Let us assume that the correct label of a datapoint is , but it was incorrectly classified as . We now add the erroneously classified hypervector to its corresponding row . (to make them more similar). We also subtract the incorrectly classified hypervector from the row corresponding to the inaccurately predicted class (to make them more distinct). In order to decrease the convergence of time and noise, it is common practice to introduce a learning rate in this step as illustrated in Figure 3a [imani2019adapthd]. This process is repeated several times.
During the inference, we predict the class to which the datapoint belongs. This datapoint is encoded as described in II-A, and then compared to all the class hypervectors. The algorithm then predicts the class with the largest cosine similarity.
So far, we described in the algorithm that the trained model has non-binarized elements with floating point values. Many existing HD computing methods [rahimi2017high, imani2017exploring, rahimi2017hyperdimensional2] binarize the class hypervectors to eliminate costly cosine operations used for the associative search (). Binary hypervectors do not provide sufficiently high classification accuracies on many (if not most) real-world applications. The usual way of binarizing class hypervectors is making all positive values equal to and negative values equal to . This method suffers from significant loss of information about the trained model. To the best of our knowledge, [QuantHD] was the first publication demonstrating a method of achieving high classification accuracy while using a binarized (or quantized) HD model. Instead of just ”blindly” binarizing the class hypervectors after every re-training iteration, QuantHD trains the model in a way that is optimized for binarized hypervectors. That is, during every single retraining iteration, they create an additional binarized model. Doing so requires no additional computational power, as the binary representation of numbers in usual computer architectures reserves the first bit for the sign ( stands for positive, for negative). The QuantHD algorithm retrains on the predictions of the binarized model, while updating the non-binarized model as described in subsection II-C and Figure 3. The binarized model achieves, after several iterations, very high classification accuracies. They are significantly higher than they would be without binary-optimized retraining.
III Stochastic binarization
Motivation: In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a two-level quantum-mechanical system, solely entailing two possible states. An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. In a classical system, the bit would have to be either in one state or the other (either logical or ). A quantum system can be in any superposition of those two states. When measuring the -component of the spin of a particle, which can have any value in the range , the quantum state is going to collapse in either the or the state, with a probability directly related to before the measurement.
This leaves us with a very interesting property: there exists a ”true” z-component of the state, which is equal to the projection of the quantum state on the z-axis. This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the non-binarized model. In other words . The QuantHD algorithm uses the following (very trivial) binarization function:
We propose using the following method instead:
where is the cutoff value defined as a fixed fraction of the standard deviation of the data. It is discussed in greater detail in subsection IV-B. The advantage of doing so is the fact that the expected value of for is proportional to x:
IV Proposed QubitHD algorithm
The QuantHD algorithm still leaves us with a significant gap between the maximum classification accuracy of the floating-point model and the binarized one. Also, the QuantHD retraining method described in [QuantHD] tends to get ”stuck” in local minima. Further, their algorithm almost doubles the average convergence rate, which hereafter increases the energy consumption during training.
To summarize, here are the main problems with the QuantHD algorithm, which the QubitHD algorithm can either solve or improve:
There still exists a significant gap between the binarized model and non-binarized model accuracy
The algorithm can sometimes get stuck in local minima, which makes it unrealiable.
The convergence time of QuantHD algorithm is almost twice as slow as compared to the other state-of-the-art HD computing algorithms with retraining
IV-A Overview of the QubitHD algorithm
In this section, we will present the QubitHD algorithm. It enables efficient binarization of the HD model with particularly minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:
Binarized retraining The retraining steps compensate for the classification accuracy loss during the previous step. Binarization through Equation (3) is performed on the model after every single retraining iteration. This ensures a fast convergence towards a consistently high classification accuracy of the ML model.
IV-B Framework of the QubitHD algorithm
In this section, we present the QubitHD algorithm. It enables efficient binarization of the HD model with minor impact on the classification accuracy. The algorithm is based on QuantHD and consists of four main steps:
2) Initial training: QubitHD trains the class hypervectors by summing all the encoded datapoints corresponding to the same class as seen in Figure 3a
It is evident from Figure 3a that every accumulated hypervector represents one class. As explained in [QuantHD], in an application with classes, the initially trained HD model contains non-binarized hypervectors , where ().
3) Stochastic binarization: This part is the main change with respect to the QuantHD algorithm. A given class hypervector is created by summing together random hypervectors of the type . Every element in a class hypervector (of class i) is the product of a ”random walk”. In other words, its distribution follows a binomial distribution with a probability mass function (pmf):
where (as we have equal probabilities for and ), n is the number of randomly summed hypervectors for class i, and k is a possible value can take. Note that .
The expected value , while the standard deviation is
Assuming that the number of hypervectors corresponding to every class in the dataset is large enough, the normal distribution is a good approximation for modeling the binomial distribution:
In previous publications [QuantHD], the way of binarizing a model was described by Equation 2. We instead propose using Equation 3 shown in Figure 4. Implementing this change requires almost no additional resources (the random flips have to be performed only once per retraining round), but leads to significant improvements in terms of accuracy, reliability, speed and energy efficiency. The accuracy improvement is due to the fact that the expected value of this stochastically binarized model is equal to the non-binarized model. The reliability and speed improvement are due to the fact that the model quickly ”jumps” out of local minima, as opposed to getting stuck for several iterations. The energy consumption during training depends on the number of retraining iterations, which are significantly reduced.
Just as we have demonstrated, the encoded data we are processing is (approximately) normally distributed. In order to be able to use Equation 3 for the binarization process, we have to define a ”cutoff” value . That is, everything above will become , and everything below will become . Only values between and will be better approximated through Equation 3. In most cases, has to be smaller than the standard deviation of the data distribution . If we would use , our model would become almost completely random, as most of the values are contained in the interval (68% to be precise).
The reason why the binarization works better than the lies in the fact that, taking into account the expected value of the binarized model, it is equal to actual values in the non-binarized model, with the exception of values below and above . Empirically, we also noticed that the randomness of prevents the algorithm from getting stuck in local minima during training, which reduces the convergence time by 50% on average.
What if we use the encoding method that allows the hypervectors of the encoded datapoints to be floating-point values, as opposed to values only? In this case, (due to randomness) it is to be assumed that we are sampling from an approximately uniform distribution. This assumption is only reasonable, if the encoding scheme is also based on random base hypervectors . The resulting probability distribution won’t be a discrete binomial distribution, but rather a slightly modified version of the continuous Irwin-Hall distribution [hall1927distribution] given by the probability density function (pdf):
where n is the number of uniform intervals summed together. Just as with the binomial distribution, the Irwin-Hall distribution also converges towards a normal distribution for large , with mean and standard deviation . It is possible to show that for a large , this distribution converges towards the normal distribution. This was to be expected because of the central limit theorem. Therefore, even in the case of floating-point numbers in the encoding, we can use the same QubitHD technique.
V Possible FPGA Implementation
It is known that HD computing-based machine learning algorithms can be implemented in a wide range of different hardware platforms, such as CPUs, GPUs and FPGAs. As most of the training and all of the inference rely on bit-wise operations, it was proposed in [QuantHD] that FPGAs would be a suitable candidate for the efficient hardware acceleration. The same hardware can be used for implementing both, the QuantHD and the QubitHD algorithm. This is also one of the major advantages of QubitHD, as it doesn’t require costly hardware upgrades from previous models.
VI-A Experimental Setup
The training and inference of the algorithm were implemented and verified using Verilog and the code was synthesized on the Kintex-7 FPGA KC705 Evaluation Kit. The Vivado XPower tool has been used to estimate the device power. Additionally, for testing purposes, all parts of the QubitHD algorithm have been implemented on CPU. We also implemented the algorithm on an embedded device (Rasberry Pi 3) with an ARM Cortex A54 CPU. For the purposes of making an accurate and fair comparison, we use the following FPGA-implemented algorithms as baselines:
The QuantHD algorithm from [QuantHD], on which QubitHD is based
Other state-of-the-art HD computing-based machine learning algorithms [ISLPED, imani2019adapthd, morris2019comphd]
A multi level perceptron (MLP) [sharma2016high] (see Table III)
A binary neural network (BNN) [umuroglu2017finn] (see Table III)
With a view to show the advantage of the QubitHD and the previous [QuantHD] algorithm, we used the datasets summarized in Table I. The datasets range from small datasets like UCIHAR and ISOLET (frequently used in IoT devices, for which QubitHD is specially created), to the larger datasets like face recognition.
|Data Size||Train Size||Test Size||Description|
|ISOLET||617||26||19MB||6,238||1,559||Voice Recognition [Isolet]|
|MNIST||784||10||220MB||60,000||10,000||Handwritten digits [lecun2010mnist]|
|EXTRA||225||4||140MB||146,869||16,343||Phone position recognition[vaizman2017recognizing]|
The evaluation of the baseline HD model provides high-classification accuracy when using non-binarized hypervectors for classification. The problem, however, is that retraining and inference with a non-binary class hypervector is very costly, as it requires calculating cosine similarities11 1 The cosine similarity of vectors and is calculated as , where is the dot product and is the absolute value of the vector.. That is, for -bit integers or floating point numbers basic operations need to be performed through every step. These are costly and impractical on small-scale and battery-powered devices.
Similarly, during inference, the associative search between query and trained model requires the calculation of the costly cosine similarities. To address this issue, many HD computing-based machine learning algorithms binarize their models [ISLPED]. That way, the cosine similarity is replaced by a simple Hamming distance similarity check. The key problem with this approach is that it leads to a significant decrease in classification accuracy, as shown in Table II.
The authors of [QuantHD] already showed the existence of a partial solution to this problem, which involves simultaneously retraining the non-binarized model, while updating the binarized model. We already listed the problems with this model in subsection IV. What especially motivated us to create the stable and more reliable QubitHD algorithm, is the fact that the QuantHD algorithm’s retraining is unstable and unreliable. After extensively testing the QubitHD algorithm, we conclude that it on average closes the gap of classification accuracy by 38.8% as compared to the baseline HD computing-based machine learning algorithms in [ISLPED] using the QuantHD framework (See Table II).
Additionally, we observe that the accuracies of the QubitHD algorithm, using a binary model, are 1.2% and 60% higher than the classification accuracies of the baseline HD computing-based algorithm using non-quantized and binary respectively.
VI-C Training Efficiency
The training efficiency of HD-based algorithms is characterised by an initial training of the model and subsequent retraining.
Algorithms in this type all consume the same energy during the generation of the initial training model
The significant cost is the retraining: compared to the non-binarized model, QuantHD uses less operations when calculating the hypervector similarities (step 4 in Figure 3).
No complex cosine similarity has to be computed as calculating the Hamming distance is sufficient to determine whether there was a correct classification or not
The improvement of QubitHD lies in the faster convergence to a high classification accuracy, which is 30-50% faster than in QuantHD and also decreases the energy consumption after the initial training proportionally.
The QubitHD modification has dual benefit. It makes it possible to save energy and time during training, whilst achieving the same or better classification accuracies during testing depending on whether the goal is rapid convergence or high classification accuracy.
VI-D Inference Efficiency
Compared to QuantHD, there is no gain or loss in the time execution or energy efficiency, since the models behave identically once they are trained. So we report the same 44 energy efficiency improvement and 5 speedup as compared to the non-binarized HD algorithm.
VI-E QubitHD comparison with MLP and BNN
QubitHD is a classifier intended to run on low powered devices, specifically with the goal of the low energy consumption and fast and efficient execution in mind. As such we set out to compare QubitHD, not only to Quant HD, but also other non-HD light-weight classifiers. In our analysis we compared QubitHD accuracy and efficiency with the state-of-the-art light-weight classifiers, including Multi-Layer Perceptron (MLP) and Binarized Neural Network (BNN). For MLP and BNN, we aimed at using the same metric as employed in [umuroglu2017finn] with the small modification in input and output layers in order to run different applications. The results of this, presented in Table III, indicate that QubitHD, while having similar classification accuracies to very light-weight classifier BNNs and MLPs, drastically reduces CPU usage during training and execution time during the inference. In particular, compared to MLPs QubitHD uses 12 less CPU during training and is 84 faster during the inference on FPGAs. In comparison with BNNs, QubitHD uses a factor of 101 less CPU time during training and is still 20 faster during the inference.
Machine learning algorithms, based on the Brain-inspired Hyperdimensional (HD) computing, imitate cognition by exploiting statistical properties of very high-dimensional vector spaces. They are a promising solution for energy-efficient classification tasks. A weakness of existing HD computing-based ML algorithms is the fact that they have to be binarized for achieving very high energy-efficiency. At the same time, binarized models reach lower classification accuracies. In order to solve the problem of the trade-off between the energy-efficiency and classification accuracy, we propose the QubitHD algorithm. With QubitHD, it is possible to use binarized HD computing-based ML algorithms, which provide virtually the same classification accuracies as their non-binarized counterparts. The algorithm is inspired by stochastic quantum state measurement techniques. The improvement of QubitHD is a duality and is reflected in the quicker convergence and the higher and more stable classification accuracy achieved, as compared to QuantHD.
Our main contributions are:
The FPGA implementation of QubitHD provides on average a 65% improvement in terms of energy-efficiency, and a 95% improvement in terms of training time, as compared to the most recent state-of-the-art HD computing-based machine learning algorithm QuantHD [QuantHD]
When compared with state-of-the-art low-cost classifiers like Binarized Neural Networks (BNN) and Multi-Layer Perceptrons (MLP), QubitHD offers a similar classification accuracy whilst reducing training time by and allows for faster inference when testing.
QubitHD decreases the classification accuracy gap between state-of-the-art binarized and non-binarized HD models by almost half.
QubitHD converges on average faster during training, thus significantly decreasing the energy consumption.
Appendix A Appendix A: Quantum bit (Qubit) measurements
Appendix A: Quantum bit (Qubit) measurements:
In the field of quantum information theory, a quantum bit (or qubit) is the fundamental unit of a quantum information. A qubit is the quantum equivalent of the classical binary bit. It is a two-level quantum-mechanical system, solely entailing two possible states (usually corresponding to two distinct energy levels). An example of this is the spin of a particle in a magnetic field, in which the two states can be taken as spin up and spin down. Another example could be an atom which can be in the ground state (with low energy) and in the excited state (with a higher energy level). In a classical system, the bit would have to be either in one state or the other (either logical or ). A quantum system can be in an superposition of those two states. Let us define two possible states of the system as and . An arbitrary quantum state can be in a superposition of these two states: , where . For the sake of completeness, let us also define the counterpart of , which is . is called the ket-notation, while is called the bra-notation. As a quantum state is directly related to the probability of finding the system in a certain state after measurement, every quantum state has to be normalized. That is, .
It might seem as if there were four degrees of freedom in , since there are two complex numbers with two degrees of freedom. However, one degree of freedom is removed by the normalization constraint , and the other degree of freedom is removed by the fact that the arbitrary overall phase of the quantum state doesn’t matter, as it has no influence on any physical observables in the one-qubit case. Hence, we can represent the state of the quantum bit as a point on a sphere of radius . This is the so called Bloch sphere, displayed in the Figure (7).
For simplicity, we are going to introduce new coordinates for representing the quantum state:
Here and are the polar coordinates.
While an ordinary (classical) bit can store only one ”piece of information”, a quantum state is much more powerful, as it can be in any state with and . The problem with quantum states is the fact that we cannot measure or observe which state they are in. Whenever we attempt to observe which state a qubit is in, it is going to collapse into a (classical) bit state. In the Bloch representation, whenever we attempt to measure the z-component of the qubit, it will either collapse into the state, or into the state. The probability of it collapsing into state is . Analogy, the probability of it collapsing into state is .
This leaves us with a very interesting property: there exists a ”true” z-component of the state, which is equal to the projection of the quantum state on the z-axis: . This gave us the following idea: why don’t we use the quantum measurement technique for binarizing the HD model? Doing so would allow us to have a binarized model whose expected value would be equal to the non-binarized model. In other words . This is the reason why we use Eq. 3 for binarization, as opposed to Eq. 2.
It is similar to the property we observe when measuring a quantum state in the direction. The probability of seeing one of the two possible measurement outcomes ( and ) is proportional to the projection on the z-axis of the quantum state, which we cannot measure directly.