Abstract
Model quantization is leveraged to reduce the memory consumption and thecomputation time of deep neural networks. This is achieved by representingweights and activations with a lower bit resolution when compared to their highprecision floating point counterparts. The suitable level of quantization isdirectly related to the model performance. Lowering the quantization precision(e.g. 2 bits), reduces the amount of memory required to store model parametersand the amount of logic required to implement computational blocks, whichcontributes to reducing the power consumption of the entire system. Thesebenefits typically come at the cost of reduced accuracy. The main challenge isto quantize a network as much as possible, while maintaining the performanceaccuracy. In this work, we present a quantization method for the UNetarchitecture, a popular model in medical image segmentation. We then apply ourquantization algorithm to three datasets: (1) the Spinal Cord Gray MatterSegmentation (GM), (2) the ISBI challenge for segmentation of neuronalstructures in Electron Microscopic (EM), and (3) the public National Instituteof Health (NIH) dataset for pancreas segmentation in abdominal CT scans. Thereported results demonstrate that with only 4 bits for weights and 6 bits foractivations, we obtain 8 fold reduction in memory requirements while loosingonly 2.21%, 0.57% and 2.09% dice overlap score for EM, GM and NIH datasetsrespectively. Our fixed point quantization provides a flexible trade offbetween accuracy and memory requirement which is not provided by previousquantization methods for UNet such as TernaryNet.
Quick Read (beta)
UNet FixedPoint Quantization for Medical Image Segmentation
Abstract
Model quantization is leveraged to reduce the memory consumption and the computation time of deep neural networks. This is achieved by representing weights and activations with a lower bit resolution when compared to their high precision floating point counterparts. The suitable level of quantization is directly related to the model performance. Lowering the quantization precision (e.g. 2 bits), reduces the amount of memory required to store model parameters and the amount of logic required to implement computational blocks, which contributes to reducing the power consumption of the entire system. These benefits typically come at the cost of reduced accuracy. The main challenge is to quantize a network as much as possible, while maintaining the performance accuracy. In this work, we present a quantization method for the UNet architecture, a popular model in medical image segmentation. We then apply our quantization algorithm to three datasets: (1) the Spinal Cord Gray Matter Segmentation (GM), (2) the ISBI challenge for segmentation of neuronal structures in Electron Microscopic (EM), and (3) the public National Institute of Health (NIH) dataset for pancreas segmentation in abdominal CT scans. The reported results demonstrate that with only 4 bits for weights and 6 bits for activations, we obtain 8 fold reduction in memory requirements while loosing only $2.21\%$, $0.57\%$ and $2.09\%$ dice overlap score for EM, GM and NIH datasets respectively. Our fixed point quantization provides a flexible trade off between accuracy and memory requirement which is not provided by previous quantization methods for UNet such as TernaryNet. ^{1}^{1} 1 Our code will be released at https://github.com/hossein1387/UNetFixedPointQuantizationforMedicalImageSegmentation
Keywords:
UNetQuantizationDeep Learning1 Introduction
Image segmentation, the task of specifying the class of each pixel in an image, is one of the active research areas in the medical imaging domain. In particular, image segmentation for biomedical imaging allows identifying different tissues, biomedical structures, and organs from images to help medical doctors diagnose diseases. However, manual image segmentation is a laborious task. Deep learning methods have been used to automate the process and alleviate the burden of segmenting images manually.
The rise of Deep Learning has enabled patients to have direct access to personal health analysis [10.1093/bib/bbx044]. Health monitoring apps on smart phones are now capable of monitoring medical risk factors. Medical health centers and hospitals are equipped with pretrained models used in medical CADs to analyse MRI images [thalers.menkovskiv.2019]. However, developing a high precision model often comes with various costs, such as a higher computational burden and a large model size. The latter requires many parameters to be stored in floating point precision, which demands high hardware resources to store and process images at test time. In medical domains, images typically have high resolution and can also be volumetric (the data has a depth in addition to width and height). Quantizing the neural networks can reduce the feedforward computation time and most importantly the memory burden at inference. After quantization, a high precision (floating point) model is approximated with a lower bit resolution model. The goal is to leverage the advantages of the quantization techniques while maintaining the accuracy of the full precision floating point models. Quantized models can then be deployed on devices with limited memory such as cellphones, or facilitate processing higher resolution images or bigger volumes of 3D data with the same memory budget. Developing such methods can reduce the required memory to save model parameters potentially up to 32x in memory footprint. In addition, the amount of hardware resources (the number of logic gates) required to perform low precision computing, is much less than a full precision model [DBLP:journals/corr/HubaraCSEB16]. In this paper, we propose a fixed point quantization of UNet [ronneberger2015u], a popular segmentation architecture in the medical imaging domain. We provide a comprehensive quantization results on the Spinal Cord Gray Matter Segmentation Challenge [gm_orig], the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks [10.1371/journal.pbio.1000502], and the public National Institute of Health (NIH) dataset for pancreas segmentation in abdominal CT scans [deeporgan.2015]. In summary, this work makes the following contributions: Developing such methods can reduce the required memory to save model parameters potentially up to 32x in memory footprint. In addition, the amount of hardware resources (the number of logic gates) required to perform low precision computing, is much less than a full precision model [DBLP:journals/corr/HubaraCSEB16].
In this paper, we propose a fixed point quantization of UNet [ronneberger2015u], a popular segmentation architecture in the medical imaging domain. We provide a comprehensive quantization results on the Spinal Cord Gray Matter Segmentation Challenge [gm_orig], the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks [10.1371/journal.pbio.1000502], and the public National Institute of Health (NIH) dataset for pancreas segmentation in abdominal CT scans [deeporgan.2015]. In summary, this work makes the following contributions:

•
We report the first fixed point quantization results on the UNet architecture for the medical image segmentation task and show that the current quantization methods available for UNet are not efficient for the common hardware in the industry.

•
We quantify the impact of quantizing the weights and activations on the performance of the UNet model on three different medical imaging datasets.

•
We report results comparable to a full precision segmentation model by using only 6 bits for activation and 4 bits for weights, effectively reducing the weights size by a factor of $8\times $ and the activation size by a factor of $5\times $.
2 Related Works
2.1 Image Segmentation
Image segmentation is one of the central problems in medical imaging [pham2000current], commonly used to detect regions of interest such as tumors. Deep learning approaches have obtained the stateoftheart results in medical image segmentation [litjens2017survey, shen2017deep]. One of the favorite architectures used for image segmentation is UNet [ronneberger2015u] or its equivalent architectures proposed around the same time; ReCombinator Networks [honari2016recombinator], SegNet [badrinarayanan2015segnet], and DeconvNet [noh2015learning], all proposed to maintain pixel level information that is usually lost due to pooling layers. These models use an encoderdecoder architecture with skip connections, where the information in the encoder path is reintroduced by skip connections in the decoder path. This architecture has proved to be quite successful for many applications that require full image reconstruction while changing the modality of the data, as in the imagetoimage translation [isola2017image], semantic segmentation [ronneberger2015u, badrinarayanan2015segnet, noh2015learning] or landmark localization [honari2016recombinator, newell2016stacked]. While all the aforementioned models propose the same architecture, for simplicity we refer to them as UNet models. UNet type models have been very popular in the medical imaging domain and have been also applied to the 3 dimensional (3D) segmentation task [cciccek20163d]. One problem with UNet is its high usage of memory due to full image reconstruction. All encoded features are required to be kept in memory and then used while reconstructing the final output. This approach can be quite demanding, especially for high resolution or 3D images. Quantization of weights and activations can reduce the required memory for this model, allowing to process images with a higher resolution or with a bigger 3D volume at test time.
2.2 Quantization for Medical Imaging Segmentation
There are two approaches to quantize a neural network, namely deterministic quantization and stochastical quantization [DBLP:journals/corr/HubaraCSEB16]. Althouh DNN quantization has been thoroughly studied [DBLP:journals/corr/HubaraCSEB16, DBLP:journals/corr/ZhouYGXC17, NIPS20155647], little effort has been done on developing quantization methods for medical image segmentation. In the following, we review recent works in this field.
Quantization in Fully Convolutional Networks: Quantization has been applied to Fully Convolutional Networks (FCN) in biomedical image segmentation [xu2018quantization]. First, a quantization module is added to the suggestive annotation in FCN. In suggestive annotation, instead of using the original dataset, a representative training dataset is used, which in turn increases the accuracy. Next, FCN segmentations are quantized using Incremental Quantization (INQ). Authors report that suggestive annotation with INQ using 7 bits results in accuracy close to or better than those obtained with a full precision model. In FCN, features of different resolutions are upsampled back to the image resolution and merged together right before the final output predictions. This approach is suboptimal compared to the UNet which upsamples features only to one higher resolution, allowing the model to process them before they are passed to higher resolution layers. This gradual resolution increase in reconstruction acts as a conditional computation, where the features of higher resolution are computed using the lower resolution features. As reported in [honari2016recombinator], this process of conditional computation results in faster convergence time and increased accuracy in the UNet type architectures compared to the FCN type architectures. Considering the aforementioned advantages of UNet, in this paper we pursue the quantization of this model.
UNet Quantization: In [DBLP:journals/corr/abs180109449], authors propose the first quantization for UNet. They introduce 1) a parameterized ternary hyperbolic tangent to be used as the activation function, 2) a ternary convolutional method that calculates matrix multiplication very efficiently in the hamming space. They report 15fold decrease in the memory requirement as well as 10x speedup at inference compared to the full precision model. Although this method shows significant performance boost, in Section 4 we demonstrate that this is not an efficient method for the currently available CPUs and GPUs.
3 Method
We propose fixed point quantization for UNet. We start with a full precision (32 bit floating point) model as our baseline. We then use the following fixed point quantization function to quantize the parameters (weights and activation) in the inference path:
$$  (1) 
where $round$ function projects its input to the nearest integer, $$ and $\gg $ are shift left and right operators, respectively. In our simulation, shift left and right are implemented by multiplication and division in powers of 2. The $clamp$ function is defined as:
$$  (2) 
Equation (1) quantizes an input $x\in \mathbb{R}$ to the closest value that can be represented by $n$ bits. To map any given number $x$ to its fixed point value we first split the number into its floating and integer parts using:
${x}_{f}=abs(x)floor(abs(x)),{x}_{i}=floor(abs(x))$  (3) 
and then use the following equation to convert $x$ to its fixed point representation using the specified number of bits for the integer ($ibits$) and fractional ($fbits$) parts:
$to\mathrm{\_}fixed\mathrm{\_}point(x,ibits,fbits)$  $=sign(x)*quantize({x}_{i},ibits)$  
$+sign(x)*quantize({x}_{f},fbits)$  (4) 
Equation (3) is a fixed point quantization function that maps a floating point number $x$ to the closest fixed point value with $ibits$ integer and $fbits$ fractional bits. Throughout this paper, we use ${Q}^{p}i.f$ notation to denote that we are using a fixed point quantization of parameter $p$ by using $i$ bits to represent the integer part and $f$ bits to represent the fractional part. Based on our experiments, we did not benefit from an incremental quantization (INQ) as explained in [DBLP:journals/corr/ZhouYGXC17]. Although this method could work for higher precision models, for instance when using fixed point ${Q}^{w}8.8$ (Quantizing weights with 8 bits integer and 8 bits fractional parts), for extreme quantization as in ${Q}^{w}0.4$, learning from scratch gave us the best accuracy with the shortest learning time. As shown in Figure S1, in the full precision case the weights of all UNet layers are in $[1,1]$, hence the integer part for the weight quantization is not required.
3.1 Training
For numerical stability and to verify the gradients can propagate in training, we demonstrate that our quantization is differentiable . Starting from Equation (2), the derivative is:
$$  (5) 
which is differentiable except on the thresholds. To make it completely differentiable, a straightthrough estimator (STE), introduced in [STECoursera], is used which passes gradients over the thresholds and also over the $round$ function in Equation (1).
3.2 Observations on UNet Quantization
3.2.1 Dropout
Dropout [srivastava2014dropout] is a regularization technique to prevent overfitting of DNNs. Although it is used in the original implementation of UNet, we found that when this technique is applied along with quantization, the accuracy drops a lot. Hence, in our implementation we removed dropout from all layers. This is due to the fact that quantization acts as a strong regularizer, as reported in [DBLP:journals/corr/HubaraCSEB16], hence further regularization with dropout is not required. As shown in Figure S2 for each quantized precision, dropout reduces the accuracy, with the gap being even higher for lower precision quantizations.
3.2.2 Full Precision Layers
It is well accepted to keep the first and the last layers in full precision, when applying quantization [DBLP:journals/corr/HubaraCSEB16, Tang2017HowTT]. However, we found that in the segmentation task, keeping the last layer in full precision has much more impact than keeping the first layer in full precision.
3.2.3 Batch Normalization
Batch normalization is a technique that improves the training speed and accuracy of DNN. We used the Pytorch implementation of batchnorm. In training, we use the quantization block after the batchnorm block in each layer (S.2 lists all the layers in our UNet implementation) such that the batchnorm is first applied using the floating point calculations and then the quantized value is sent to the next layer (hence not quantizing the batchnorm block during training). However, at inference, Pytorch folds the batchnorm parameters into the weights, effectively including batchnorm parameters in the quantized model as part of the quantized weights.
4 Results and Discussion
We implemented the UNet model and our fixedpoint quantizer in Pytorch. We trained our model over 200 epochs with a batch size of 4. We applied our fixed point quantization along with TernaryNet [DBLP:journals/corr/abs180109449] and Binary [NIPS20155647] quantization on three different datasets: GM [gm_orig], EM [10.1371/journal.pbio.1000502], and NIH [deeporgan.2015]. For GM and EM datasets, we used an initial learning rate of $1e3$ and for NIH we used initial learning rate of $0.0025$. For all datasets we used Glorot for weight initialization and cosine annealing scheduler to reduce learning rate in training. Please check our repository for the model and training details.
NIH pancreas [deeporgan.2015] dataset is composed of 82 3D abdominal CT scan and their corresponding pancreas segmentation images. Unfortunately, we did not had access to the preprocessed dataset described in [DBLP:journals/corr/abs180109449], nevertheless, we extracted 512x512 2D slices from the original dataset and applied a region of interest cropping to get 7059 images of size 176x112. The final dataset contains 7059 176x112 2D images which are separated into training and testing dataset (respectively 80% and 20%). For GM and EM datasets, we used the provided dataset as described in [gm_orig] and [10.1371/journal.pbio.1000502] respectively. For both EM and GM datasets, we did not used any region of interest cropping and we used images of size 200x200.
The task of image segmentation for GM and NIH pancreas datasets is imbalanced. As suggested in [gm_orig], instead of weighted crossentropy, we used a surrogate loss for the dice similarity coefficient. This loss is referred to as the dice loss and is formulated as ${\mathcal{L}}_{dice}=\frac{2{\sum}_{n=1}^{N}{p}_{n}{r}_{n}+\u03f5}{{\sum}_{n=1}^{N}{p}_{n}+{\sum}_{n=1}^{N}{r}_{n}+\u03f5}$, where ${p}_{n}\in [0,1]$ and ${r}_{n}\in \{0,1\}$ are prediction and ground truth pixels respectively (with $0$ indicating not belonging and $1$ indicating belonging to the class of interest) and $\u03f5$ is the added noise for numerical stability. For the EM dataset, using a weighted sum of cross entropy and dice loss produced the best results.
Figure 1 along with Table 1 show different quantization methods on the aforementioned datasets. Considering NIH dataset, Figure 1(top) and Table 1 show that despite using only 1 and 2 bits to represent network parameters, Binary and TernaryNet quantizations produce results that are close to the full precision model. However, for other datasets, our fixed point ${Q}^{a}$6.0, ${Q}^{w}$0.4 quantization surpasses Binary and TernaryNet quantization. The other important factor here is how efficient these quantization techniques can be implemented using the current CPU and GPU hardware. At the time of writing this paper, there is no commercially available CPU or GPU that can efficiently store and load sub8bit parameters of a neural network, which leaves us to use custom functions to do bit manipulation to make sub8bit quantization more efficient. Moreover, in the case of TernaryNet, to apply floating point scaling factor after ternary convolution, floating point operations are required. Our fixed point quantization uses only integer operations, which requires less hardware footprint and use less power compared to floating point operations. Finally, TernaryNet uses Tanh instead of ReLU for the activations. Using hyperbolic tangent as an activation function increases training time [NIPS2012_4824] and execution time at inference. To verify it, we evaluated the performance of ReLU and Tanh in a simple neural network with 3 fully connected layers. We used the Intel’s OpenVino [deannedeuermeyerandreyz.amyr.fritzb.2019] inference engine together with high performance gemm_blas and avx2 instructions. Table 2 illustrates that using ReLU instead of Tanh at training and inference can increase performance by up to 8 times. These results can be extended to UNet, since activation inference time is only a function of the input size. To compensate for the computation time, TernaryNet implements an efficient ternary convolution that gains up to 8 times in performance. At inference, an efficient Tanh function can be implemented that uses only two comparators to perform Tanh for ternary values. Considering accuracy, when Tanh is used as an activation function, the full precision accuracy is lower compared to ReLU [DBLP:journals/corr/abs180109449]. We observe similar behavior in the results reported in Table 1. Our fixed point quantizer provides a flexible tradeoff between accuracy and memory, which makes it a practical solution for the current CPUs and GPUs, does not requite floatingpoint operations, and leverages the more efficient ReLU function. As opposed to BNN and TernaryNet quantizations, Table 1 shows that our approach for quantization of UNet provides consistent results over 3 different datasets.
Quantization  EM Dataset  GM Dataset  NIH Panceas  
Activation  Weight 





Dice Score  
Full Precision  18.48 MBytes  94.05  93.02  56.32  56.26  75.69  
Q8.8  Q8.8  9.23 MBytes  92.02  91.08  56.11  56.01  74.61  
Q8.0  Q0.8  4.61 MBytes  92.21  88.42  56.10  53.78  73.05  
Q6.0  Q0.4  2.31 MBytes  91.03  90.93  55.85  52.34  73.48  
Q4.0  Q0.2  1.15 MBytes  79.80  54.23  51.80  48.23  71.77  
BNN [NIPS20155647]  0.56 MBytes  78.53    31.44    72.56  
TernaryNet [DBLP:journals/corr/abs180109449]  1.15 MBytes    82.66    43.02  73.9 
5 Conclusion
In this work, we proposed a fixed point quantization method for the UNet architecture and evaluated it on the medical image segmentation task. We report quantization results on three different semantic segmentation datasets and show that our fixed point quantization produces more accurate and also more consistent results over all these datasets compared to other qunatization techniques. We also demonstrate that Tanh, as the activation function, reduces the base line accuracy and also adds a computational complexity in both training and inference. Our proposed fixed point quantization technique provides a tradeoff between accuracy and the required memory, does not require floating point computation and is more suitable for the currently available CPU and GPU hardware.
References
Supplementary Information for
UNet FixedPoint Quantization for Medical Image Segmentation
S.1 Weight Visualization of FullPrecision UNet
S.2 Model Architecture
 Layer (type) Output Shape Param # ================================================================ Conv2d1 [ 64 , 200, 200] 640 BatchNorm2d2 [ 64 , 200, 200] 128 QuantLayer3 [ 64 , 200, 200] 0 Conv2d4 [ 64 , 200, 200] 36,928 BatchNorm2d5 [ 64 , 200, 200] 128 QuantLayer6 [ 64 , 200, 200] 0 DownConv7 [ 64 , 200, 200] 0 MaxPool2d8 [ 64 , 100, 100] 0 Conv2dQuant9 [ 128, 100, 100] 73,856 BatchNorm2d10 [ 128, 100, 100] 256 QuantLayer11 [ 128, 100, 100] 0 Conv2dQuant12 [ 128, 100, 100] 147,584 BatchNorm2d13 [ 128, 100, 100] 256 QuantLayer14 [ 128, 100, 100] 0 DownConv15 [ 128, 100, 100] 0 MaxPool2d16 [ 128, 50 , 50 ] 0 Conv2dQuant17 [ 256, 50 , 50 ] 295,168 BatchNorm2d18 [ 256, 50 , 50 ] 512 QuantLayer19 [ 256, 50 , 50 ] 0 Conv2dQuant20 [ 256, 50 , 50 ] 590,080 BatchNorm2d21 [ 256, 50 , 50 ] 512 QuantLayer22 [ 256, 50 , 50 ] 0 DownConv23 [ 256, 50 , 50 ] 0 MaxPool2d24 [ 256, 25 , 25 ] 0 Conv2dQuant25 [ 256, 25 , 25 ] 590,080 BatchNorm2d26 [ 256, 25 , 25 ] 512 QuantLayer27 [ 256, 25 , 25 ] 0 Conv2dQuant28 [ 256, 25 , 25 ] 590,080 BatchNorm2d29 [ 256, 25 , 25 ] 512 QuantLayer30 [ 256, 25 , 25 ] 0 DownConv31 [ 256, 25 , 25 ] 0 Upsample32 [ 256, 50 , 50 ] 0 Conv2dQuant33 [ 256, 50 , 50 ] 1,179,904 BatchNorm2d34 [ 256, 50 , 50 ] 512 QuantLayer35 [ 256, 50 , 50 ] 0 Conv2dQuant36 [ 256, 50 , 50 ] 590,080 BatchNorm2d37 [ 256, 50 , 50 ] 512 QuantLayer38 [ 256, 50 , 50 ] 0 DownConv39 [ 256, 50 , 50 ] 0 UpConv40 [ 256, 50 , 50 ] 0 Upsample41 [ 256, 100, 100] 0 Conv2dQuant42 [ 128, 100, 100] 442,496 BatchNorm2d43 [ 128, 100, 100] 256 QuantLayer44 [ 128, 100, 100] 0 Conv2dQuant45 [ 128, 100, 100] 147,584 BatchNorm2d46 [ 128, 100, 100] 256 QuantLayer47 [ 128, 100, 100] 0 DownConv48 [ 128, 100, 100] 0 UpConv49 [ 128, 100, 100] 0 Upsample50 [ 128, 200, 200] 0 Conv2dQuant51 [ 64 , 200, 200] 110,656 BatchNorm2d52 [ 64 , 200, 200] 128 QuantLayer53 [ 64 , 200, 200] 0 Conv2dQuant54 [ 64 , 200, 200] 36,928 BatchNorm2d55 [ 64 , 200, 200] 128 QuantLayer56 [ 64 , 200, 200] 0 DownConv57 [ 64 , 200, 200] 0 UpConv58 [ 64 , 200, 200] 0 Conv2d59 [ 1 , 200, 200] 577 ================================================================ Total params: 4,837,249 Trainable params: 4,837,249 Nontrainable params: 0  Input size (MB): 0.15 Forward/backward pass size (MB): 593.57 Params size (MB): 18.45 Estimated Total Size (MB): 612.17 