Left Ventricle Quantification Using Direct Regression with Segmentation Regularization and Ensembles of Pretrained 2D and 3D CNNs

  • 2019-08-12 14:46:00
  • Nils Gessert, Alexander Schlaefer
  • 1


Cardiac left ventricle (LV) quantification provides a tool for diagnosingcardiac diseases. Automatic calculation of all relevant LV indices from cardiacMR images is an intricate task due to large variations among patients anddeformation during the cardiac cycle. Typical methods are based on segmentationof the myocardium or direct regression from MR images. To consider cardiacmotion and deformation, recurrent neural networks and spatio-temporalconvolutional neural networks (CNNs) have been proposed. We study an approachcombining state-of-the-art models and emphasizing transfer learning to accountfor the small dataset provided for the LVQuan19 challenge. We compare 2Dspatial and 3D spatio-temporal CNNs for LV indices regression and cardiac phaseclassification. To incorporate segmentation information, we propose anarchitecture-independent segmentation-based regularization. To improve therobustness further, we employ a search scheme that identifies the optimalensemble from a set of architecture variants. Evaluating on the LVQuan19Challenge training dataset with 5-fold cross-validation, we achieve meanabsolute errors of 111 +- 76mm^2, 1.84 +- 0.9mm and 1.22 +- 0.6mm for area,dimension and regional wall thickness regression, respectively. The error ratefor cardiac phase classification is 6.7%.


Quick Read (beta)

Left Ventricle Quantification Using Direct Regression with Segmentation Regularization and Ensembles of Pretrained 2D and 3D CNNs

Nils Gessert 11Institute of Medical Technology, Hamburg University of Technology, Hamburg, Germany

[email protected]
   Alexander Schlaefer 11Institute of Medical Technology, Hamburg University of Technology, Hamburg, Germany

[email protected]

Cardiac left ventricle (LV) quantification provides a tool for diagnosing cardiac diseases. Automatic calculation of all relevant LV indices from cardiac MR images is an intricate task due to large variations among patients and deformation during the cardiac cycle. Typical methods are based on segmentation of the myocardium or direct regression from MR images. To consider cardiac motion and deformation, recurrent neural networks and spatio-temporal convolutional neural networks (CNNs) have been proposed. We study an approach combining state-of-the-art models and emphasizing transfer learning to account for the small dataset provided for the LVQuan19 challenge. We compare 2D spatial and 3D spatio-temporal CNNs for LV indices regression and cardiac phase classification. To incorporate segmentation information, we propose an architecture-independent segmentation-based regularization. To improve the robustness further, we employ a search scheme that identifies the optimal ensemble from a set of architecture variants. Evaluating on the LVQuan19 Challenge training dataset with 5-fold cross-validation, we achieve mean absolute errors of 111±76mm2, 1.84±0.9mm and 1.22±0.6mm for area, dimension and regional wall thickness regression, respectively. The error rate for cardiac phase classification is 6.7%.

Left Ventricle Quantification, Transfer Learning, Regression, Regularization

1 Introduction

Left ventricle (LV) quantification from cardiac magnetic resonance imaging (MRI) data is often employed for assessment of cardiac function and for diagnosing diseases [7]. The relevant LV indices include the myocardium and cavity area, three LV cavity dimensions, six regional wall thickness (RWT) parameters and the cardiac phase (systole and diastole). In practice, LV indices are usually obtained by manual segmentation of the myocardium which is time-consuming and associated with a high intra- and inter-observer variability [12].

In recent years, a lot of work has gone into automatic LV indices estimation which is challenging due to high variability of cardiac structure between patients and deformation during the cardiac cycle. To overcome these problems, deep learning methods have been employed as they have shown success for a variety of image-based learning problems. One approach is to segment the myocardium with a convolutional neural network (CNN) and calculate relevant metrics afterwards [1, 10]. Alternatively, LV indices can be regressed directly from the images [17, 16, 8]. Other methods have combined segmentation and regression, e.g., by regressing indices from a segmentation with an end-to-end model [13] or by adding a regression path to a segmentation model [15].

Besides incorporating segmentation and direct estimation, handling temporal dependencies is important as well. Often, temporal dependencies are modeled using recurrent neural networks which have also been employed for LV quantification [16]. Another approach is to utilize spatio-temporal 3D CNNs to capture the relation between temporal slices [6].

In this work, we describe a new approach for LV quantification in the context of the LVQuan19 Challenge. In contrast to a majority of previous work, the associated dataset is significantly smaller (56 patients) and the MRI images are hardly preprocessed with a high spatial resolution but without any region of interest (ROI) cropping. Thus, an algorithm needs to deal with small dataset size and make use of the high image resolution while focusing on the relevant region in the image.

To address these challenges, we take previous approaches into account while putting a strong focus on using pretrained CNNs. Using transfer learning from ImageNet has been successful for a lot of medical imaging modalities, particularly when data is as scarce as in our case [11, 9]. Models trained for the common problem of image classification can be adapted for regression by replacing the output layer. Thus, we perform direct LV indices regression using various pretrained CNNs. For temporal processing, we employ spatio-temporal 3D CNNs. We enable effective training of these parameter-intensive 3D CNNs with an initialization strategy where we assign pretrained 2D weights to 3D kernels. We address high spatial image resolution paired with uncertain ROIs by using a multi-crop evaluation strategy that covers the entire image. To incorporate segmentation information into predefined models, we propose an architecture-independent regularization by adding a decoder to the model. Finally, we integrate all our models with a new ensembling approach where we automatically select the best performing ensemble for each regression and classification task.

2 Methods

2.1 Dataset and Preprocessing

The LVQuan19 training dataset consists of short-axis MRI data from 56 patients. For each patient 20 slices representing one cardiac cycle are provided. The resolution of the MRI slices is either 256×256 or 512×512 with a pixel spacing between 0.6836mmpixel and 1.7188mmpixel. For each slice, segmentation masks of the myocardium and cavity area, all 11 LV indices and the cardiac phase are provided. The LVQuan19 Challenge goal is the estimation of all 11 indices and the cardiac phase, the segmentation masks can be optionally used.

First, we resize all images to have a pixel spacing of 1mmpixel. Second, we take a center crop of size 300×300 which is the smallest image size of all resized slices. We clip intensities between the 1st and 99th percentile. Afterwards, we perform image normalization by subtracting the mean and dividing by the standard deviation for each slice. Then, the intensities are scaled between 0 and 1. We convert the target indices to mm. Then, we scale all regression targets to a range of 0-1 for training. For evaluation, the targets are scaled back to their original range. We split the dataset into 5 cross-validation (CV) folds.

2.2 Models

Figure 1: Overview of our approach. We use both 2D CNNs with 2D slices (top) and 3D CNNs with temporally stacked slices (bottom). Both are initialized with pretrained weights from ImageNet. The initial and Conv2D/Conv3D blocks have a different structure based on the respective architecture.

2D CNN Approaches. The key idea of our approach is to use pretrained architectures for indices regression. Thus, we consider a pool of pretrained architectures including Densenet (DN) [5], Resnet (RN) [3], Resnext (RX) [14] and Squeeze-and-Excitation Networks (SE) [4]. The overall approach is shown in Figure 1.

Each CNN was pretrained on ImageNet for classification of natural images into 1000 classes. We replace the model’s output layer to match the number of outputs for our problem. We consider regression only with 11 outputs, classification (cardiac phase) with 2 softmaxed outputs or both simultaneously with 13 outputs. The pretrained model expects a 3-channel input image which we handle with two approaches. Either, we use a single slice, copied to all three channels or we include the two neighboring slices.

3D CNN Approaches. We also consider spatio-temporal 3D CNN approaches. We hypothesize that temporal context might improve indices regression. Furthermore, cardiac phase classification requires temporal context and including more slices might be helpful. The 3D CNN input is of size H×W×NS where H and W are the spatial slice dimensions and NS[1,20] is the number of selected slices. We also employ pretrained models to tackle small dataset size. Thus, we reuse the same pool of 2D CNNs and extend them to 3D by replacing all 2D operations by their 3D counterpart. The 3D convolutional kernels are initialized by copying the 2D kernels pretrained on ImageNet of size hc×wc several times into the new kernel of size hc×wc×dc. To ensure consistent value ranges, the new kernels’ weights are multiplied by 1/dc. Throughout the entire network, we do not change the slice dimension, i.e., we produce NS predictions for NS input slices in a single forward pass. For this purpose, we replace the linear output layer by a convolutional layer with kernel size 1 which is able to handle arbitrarily-sized inputs. Due to increase in memory requirements, we only consider 3D variants of the smaller CNNs in our pool which includes Densenet121, Densenet161, and SE-Resnet101.

Segmentation Regularization. In addition, we propose a segmentation-based regularization (SR). Here, we add an additional decoder to the architecture, before its global average pooling (GAP) layer. The decoder upsamples the spatial dimension of the feature maps in several steps until the original input image size is reached. At each step, we apply a convolution with kernel size 1 which halves the current feature map dimension, followed by nearest neighbor upsampling with a factor of 2. Then, a standard Resnet block is applied. In total, there are 5 upsampling stages. At the output, we predict a softmaxed probability map which is used to calculate a cross-entropy loss with the ground-truth masks. During training, the loss is propagated through the entire network, forcing the core architecture to learn features for segmentation and indices regression simultaneously. We do not use the predicted segmentation explicitly for indices calculation, it only serves as a regularizer for the network. We employ this regularization strategy both for 2D and 3D CNNs.

Data Augmentation. Due to the small dataset size, we employ extensive data augmentation. We use random rotation with θ[0°,360°] which we found more effective than simple 90° rotations. Furthermore, we employ random scaling by resizing the slices by a factor of sc[0.8,1.2] with appropriate cropping and zero padding, if required. The targets are scaled accordingly (quadratic for areas). We chose a small batch size of b=8 to induce more variation during training. We did not see any improvement for dropout, L1 or L2 regularization.

Model Input Strategy. The pretrained models’ standard input size is 224×224 while our preprocessed images are of size 300×300. Therefore, we crop patches from random image locations during training. This should have a regularizing effect as the CNN gets more robust towards different relative LV locations in the images. For 2D CNNs, we randomly select a cropped slice from b different patients to construct a batch of size b. For 3D CNNs, we randomly select a sequence of NS slices from b patients for each batch. For evaluation, we follow a multi-crop evaluation approach to cover the entire image [2]. We crop from 4 predefined locations in each slice and average the result. For 2D CNNs, this is repeated for every slice for each patient. For 3D CNNs, we crop sequences of NS slices from every possible location ns[1,20]. Then, we average over all overlapping regions to obtain a final prediction for each slice. To handle the start and end of the sequence, we use cyclic repetition.

Training. In total, our loss function consists of the mean squared error (MSE) for regression, the cross-entropy (CE) loss for phase classification and the CE loss for segmentation. The two CE losses are not present in every model. If they are included, phase classification is weighted by λP=0.05 and the segmentation loss is weighted by λS=0.1. The final loss is the sum of all individual loss functions. For optimization we employ Adam with a learning rate of lr=1×10-4. We train each CV model for 150 epochs where a single epoch consists of 10 random crops from each patient in the training set. Overall, we train multiple models with different configurations si. The configuration options include the network architecture (e.g. Densenet121), the input dimension (2D or 3D), NS{3,5,7,10} for 3D CNNs, weight initialization (random or ImageNet), segmentation-based regularization (on or off) and prediction targets (areas, dimensions, RWTs, phase).

Ensembling. Instead of deciding for a single model we search for the optimal ensemble. Let S={s1,,sn} be the set of all configurations we consider, and let V={v1,,vm} be the set of all CV splits. Then we obtain predictions y^ij=f(si,vj) for all combinations of i and j after training configuration si on Vvj. We obtain the predictions per configuration through concatenation as y^i=jy^ij. Subsequently we perform an exhaustive search to identify the set S*S such that y^*=1|S*|kS*y^k minimizes the error between y^* and ground-truth y. For the challenge test set we obtain predictions with all models in the optimal subset S* which are subsequently averaged into a final prediction. We search for the optimal subset for each task (areas, dimensions, RWTs, phase) individually to maximize performance. To keep the search time bounded, S only includes our top 20 configurations, ranked by individual CV performance.

Our proposed strategy potentially leads to implicit overfitting of the subset choice to the CV sets. Therefore, results reported for the CV sets might overestimate the performance gain of our ensembling strategy. We overcome this problem by introducing an additional test split for ensemble evaluation only. Here, we split each CV fold into two folds. We use the first portion of these new sets to find the optimal subset with our strategy. Then, we use the second portion of sets to evaluate the strategy.

Table 1: All results for different configurations. We consider the mean absolute error (MAE) with standard deviation in mm (mm2 for areas) and Pearson correlation coefficient (PCC) for regression and the error rate (ER) for classification. Configurations include no pretraining (nopre), segmentation regularization (SR), joint indices regression and phase classification (Joint) and phase classification only (Class.). For ensembling, we consider taking the average over all models (Average) and our new strategy (Optimal). Results marked with a star (*) are evaluated on a different test split, see Section 2.2. We use models based on Densenet (DN), Resnet (RN) and Resnext (RX).
Configuration Areas Dimensions RWTs Phase
DN121 2D nopre 199±129 0.935 2.98±1.8 0.945 1.55±0.9 0.770 -
DN121 2D 139±74 0.972 2.38±1.3 0.967 1.33±0.7 0.835 -
DN121 2D SR 𝟏𝟑𝟑±𝟕𝟔 0.974 2.08±1.2 0.975 1.30±0.7 0.847 -
DN121 2D Joint 161±89 0.960 2.54±1.3 0.963 1.39±0.7 0.823 9.0
DN121 2D Class. - - - - - - 8.4
DN121 3D nopre 180±165 0.940 2.77±2.4 0.943 1.48±0.8 0.798 -
DN121 3D 133±82 0.971 2.26±1.3 0.968 1.32±0.7 0.838 -
DN121 3D SR 𝟏𝟐𝟔±𝟕𝟏 0.975 2.14±1.2 0.972 1.30±0.8 0.844 -
DN121 3D Joint 146±74 0.966 2.33±1.2 0.969 1.43±0.8 0.812 7.9
DN121 3D Class. - - - - - - 7.5
DN169 2D SR 122±72 0.976 1.99±1.2 0.976 1.27±0.7 0.853 -
DN161 2D SR 127±81 0.971 2.00±1.1 0.978 1.30±0.7 0.843 -
SE-RN101 2D SR 126±70 0.974 2.01±1.1 0.977 1.32±0.8 0.844 -
SE-RN152 2D SR 124±81 0.974 2.07±1.2 0.975 1.29±0.7 0.849 -
RX101-64d 2D SR 𝟏𝟏𝟖±𝟕𝟐 0.976 1.99±1.0 0.978 1.25±0.7 0.859 -
SE-RX101 2D SR 135±80 0.971 2.23±1.2 0.973 1.33±0.8 0.839 -
SENet154 2D SR 129±81 0.973 2.12±1.3 0.973 1.31±0.8 0.846 -
Ensemble Average 118±76 0.977 1.96±1.1 0.978 1.26±0.6 0.856 7.7
Ensemble Optimal 𝟏𝟏𝟏±𝟕𝟔 0.979 1.84±0.9 0.980 1.22±0.6 0.864 6.7
Ensemble Average* 117±75 0.978 1.96±1.0 0.978 1.23±0.6 0.860 8.3
Ensemble Optimal* 111±75 0.979 1.85±1.0 0.980 1.19±0.6 0.871 7.0
Ensemble Testset 371 0.925 3.02 0.957 2.53 0.826 11.5
DMTRL [16] 180±118 0.945 2.51±1.6 0.925 1.39±0.7 0.768 8.2

3 Results

All results are shown in Table 1. For Densenet121 (DN121) with 2D slice inputs, the pretrained model with segmentation regularization performs best. The difference in the median of the absolute errors of DN121 2D and DN121 2D SR is significant for areas and dimensions (Wilcoxon signed-rank test, α = 5% significance level). Combining regression and classification leads to a lower performance than training separate models. For DN121 in its 3D version (NS=5), the overall performance improves slightly with respect to its 2D counterparts. Again, the difference in the median of the absolute errors of the 2D and 3D model is significant for areas and dimensions.

Considering different architectures, improved performance can be observed for larger models. The best performing model is RX101-64d. With respect to ensembling, taking the average over all our models does not perform better than the best single model. Using our optimal subset strategy, performance is improved. When evaluating on a different test split (*), the performce difference is still large between averaging and our strategy. The difference in the median of the absolute errors for averging and our ensembling strategy is statically significant for all three indices. Our final ensembles mostly contain the models RX101-64d 2D SR, DN 2D SR variants, and DN121 3D SR. On the test set of the LVQuan19 Challenge, the performance of the ensemble is substantially lower. For reference, we include the results from DMTRL [16]. Note that these results are not directly comparable, as a different number of patients and different image resolutions were used.

4 Discussion and Conclusion

We address LV quantification from cardiac MRI images with a focus on utilizing pretrained models. We consider a variety of deep learning models that have been successful for classification of 2D images. We adopt these models by replacing the output layer and performing direct LV indices regression from 2D MRI slices. We find a substantial increase in performance with pretrained weights, e.g., the MAE for area estimation improves from 199mm2 to 134mm2. This is likely tied to the new and small dataset which contains only roughly a third of the number of patients compared to most previous studies [16, 13]. At the same time, the MRI images have a higher resolution which is closer to the standard input resolution of models trained on ImageNet. Both likely lead to a substantial advantage of utilizing pretrained models.

To incorporate previous segmentation-based approaches, we propose a segmentation regularization by adding an architecture-independent decoder close to the model output. The additional segmentation loss forces the model core to learn both features for direct indices regression and myocardium segmentation. Our results indicate that including the segmentation is advantageous as we observe a statisticially significant increase in performance both for areas and dimensions. This matches insights from previous work where using both a segmentation mask and LV indices lead to improved results [15, 13].

Furthermore, we address temporal dependencies by extending the existing 2D CNN models to 3D. To enable 3D CNN usage with very limited data, we use an initialization strategy where we copy pretrained 2D weights to 3D kernels. Again, we find a substantial increase in performance by relying on pretraining, see Table 1. There is a significant improvement for dimension and area regression for 3D CNNs over 2D CNNs with an MAE of 139mm2 compared to 133mm2 for the areas. For the cardiac phase it is notable that the 2D approach with neighboring slices performs reasonably well and only slightly worse than 3D CNNs. Overall, enabled by our initialization strategy, spatio-temporal 3D CNNs improve performance over spatial 2D CNNs for LV quantification.

Next, we consider different baseline architectures for our approach. Using larger architectures with more layers and/or more feature maps tends to improve performance over the DN121 baseline. In particular, it is notable that the highest performance increase among different models is substantially larger than the performance increase of moving from 2D to 3D. RX101-64d 2D improves the MAE by 16mm2 over DN121 compared to a 6mm2 decrease caused by using 3D convolutions. In the best case, one would extend the best 2D model to 3D for performance maximization, however, this is limited by GPU memory and not feasible for the larger, higher performing 2D models. Summarized, using high-performing 2D architectures can be very beneficial for LV quantification.

Last, we combine all our models with a new ensembling technique where the best performing models were automatically selected based on cross-validation performance. The method improves performance over simply averaging predictions across all models. Also, we used separate test splits to ensure that the optimal subset selection does not implicitly overfit to the CV sets. Even for this evaluation scenario our ensembling method performs better than averaging with statistically significant performance differences. Interestingly, the selection method included both 2D and 3D CNNs, which indicates that both spatial and spatio-temporal information is important for LV indices regression. On the LVQuan19 test set, our method performs substantially worse than in our CV experiments. This indicates that the test set is very challenging and potentially differs from the training set. Similar observations were made for the last year’s challenge [15]. Thus, generalizable LV quantification remains a challenging task. Future work could incorporate our approach into other frameworks, e.g., by considering multi-task relationship learning [16] or recurrent models for temporal processing.


  • [1] Avendi, M., Kheradvar, A., Jafarkhani, H. (2016) A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med Image Anal 30, 108–119
  • [2] Gessert, N., Sentker, T., Madesta, F., Schmitz, R., Kniep, H., Baltruschat, I., Werner, R., Schlaefer, A. (2019) Skin Lesion Classification Using CNNs with Patch-Based Attention and Diagnosis-Guided Loss Weighting. IEEE Trans Biomed Eng
  • [3] He, K., Zhang, X., Ren, S., Sun, J. (2016) Deep residual learning for image recognition. In: CVPR, pp. 770–778
  • [4] Hu, J., Shen, L., Sun, G. (2018) Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141
  • [5] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q. (2017) Densely connected convolutional networks. In: CVPR, pp. 4700–4708
  • [6] Jang, Y., Kim, S., Shim, H., Chang, H.J. (2018) Full Quantification of Left Ventricle Using Deep Multitask Network with Combination of 2D and 3D Convolution on 2D+ t Cine MRI. In: International Workshop on Statistical Atlases and Computational Models of the Heart, pp. 476–483. Springer
  • [7] Karamitsos, T.D., Francis, J.M., Myerson, S., Selvanayagam, J.B., Neubauer, S. (2009) The role of cardiovascular magnetic resonance imaging in heart failure. J Am Coll Cardiol 54(15), 1407–1424
  • [8] Li, J., Hu, Z. (2018) Left Ventricle Full Quantification using Deep Layer Aggregation based Multitask Relationship Learning. In: International Workshop on Statistical Atlases and Computational Models of the Heart, pp. 381–388. Springer
  • [9] N. Gessert and M. Lutz and M. Heyder and S. Latus and D. M. Leistner and Y. S. Abdelwahed and A. Schlaefer (2019) Automatic Plaque Detection in IVOCT Pullbacks Using Convolutional Neural Networks. IEEE Transactions on Medical Imaging 38(2), 426–434
  • [10] Ngo, T.A., Lu, Z., Carneiro, G. (2017) Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Med Image Anal 35, 159–171
  • [11] Shin, H.C., Roth, H.R., Gao, M., Lu, L., Xu, Z., Nogues, I., Yao, J., Mollura, D., Summers, R.M. (2016) Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imag 35(5), 1285–1298
  • [12] Suinesiaputra, A., Bluemke, D.A., Cowan, B.R., Friedrich, M.G., Kramer, C.M., Kwong, R., Plein, S., Schulz-Menger, J., Westenberg, J.J., Young, A.A., et al. (2015) Quantification of LV function and mass by cardiovascular magnetic resonance: multi-center variability and consensus contours. J Cardiovasc Magn Reson 17(1), 63
  • [13] Wang, W., Wanga, Y., Wu, Y., Lin, T., Li, S., Chen, B. (2019) Quantification of Full Left Ventricular metrics via Deep Regression Learning with Contour-guidance. IEEE Access
  • [14] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K. (2017) Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500
  • [15] Xu, H., Schneider, J.E., Grau, V. (2018) Calculation of Anatomical and Functional Metrics Using Deep Learning in Cardiac MRI: Comparison Between Direct and Segmentation-Based Estimation. In: International Workshop on Statistical Atlases and Computational Models of the Heart, pp. 402–411. Springer
  • [16] Xue, W., Brahm, G., Pandey, S., Leung, S., Li, S. (2018) Full left ventricle quantification via deep multitask relationships learning. Med Image Anal 43, 54–65
  • [17] Xue, W., Lum, A., Mercado, A., Landis, M., Warrington, J., Li, S. (2017) Full quantification of left ventricle via deep multitask learning network respecting intra-and inter-task relatedness. In: MICCAI, pp. 276–284. Springer