Lossy Image Compression with Recurrent Neural Networks: from Human Perceived Visual Quality to Classification Accuracy

  • 2019-10-08 15:43:29
  • Maurice Weber, Cedric Renggli, Helmut Grabner, Ce Zhang
  • 0

Abstract

Deep neural networks have recently advanced the state-of-the-art in imagecompression and surpassed many traditional compression algorithms. The trainingof such networks involves carefully trading off entropy of the latentrepresentation against reconstruction quality. The term quality cruciallydepends on the observer of the images which, in the vast majority ofliterature, is assumed to be human. In this paper, we go beyond this notion ofquality and look at human visual perception and machine perceptionsimultaneously. To that end, we propose a family of loss functions that allowsto optimize deep image compression depending on the observer and to interpolatebetween human perceived visual quality and classification accuracy. Ourexperiments show that our proposed training objectives result in compressionsystems that, when trained with machine friendly loss, preserve accuracy muchbetter than the traditional codecs BPG, WebP and JPEG, without requiringfine-tuning of inference algorithms on decoded images and independent of theclassifier architecture. At the same time, when using the human friendly loss,we achieve competitive performance in terms of MS-SSIM.

 

Quick Read (beta)

Lossy Image Compression with Recurrent Neural Networks: from Human Perceived Visual Quality to Classification Accuracy

Maurice Weber
[email protected]
Department of Computer Science, ETH Zürich, Switzerland
   Cedric Renggli11footnotemark: 1
[email protected]
   Helmut Grabner
[email protected]
ZHAW School of Engineering, Switzerland
   Ce Zhang11footnotemark: 1
[email protected]
Abstract

Deep neural networks have recently advanced the state-of-the-art in image compression and surpassed many traditional compression algorithms. The training of such networks involves carefully trading off entropy of the latent representation against reconstruction quality. The term quality crucially depends on the observer of the images which, in the vast majority of literature, is assumed to be human. In this paper, we go beyond this notion of quality and look at human visual perception and machine perception simultaneously. To that end, we propose a family of loss functions that allows to optimize deep image compression depending on the observer and to interpolate between human perceived visual quality and classification accuracy. Our experiments show that our proposed training objectives result in compression systems that, when trained with machine friendly loss, preserve accuracy much better than the traditional codecs BPG, WebP and JPEG, without requiring fine-tuning of inference algorithms on decoded images and independent of the classifier architecture. At the same time, when using the human friendly loss, we achieve competitive performance in terms of MS-SSIM.

\thanksmarkseries

arabic

1 Introduction

\thesubsubfigure Original
\thesubsubfigure BPG
\thesubsubfigure RNN-H
(a) RNN-C
Human Perception Classification (MS-SSIM) (Val. Accuracy) BPG compression [5] 0.894 0.644 Ours, Human optimized 0.906 0.619 Ours, Classification optimized 0.794 0.724 Uncompressed - 0.803
(b) MS-SSIM vs. classification accuracy. Bitrate 0.12 bpp.
Figure 1: Images 10(a) are compressed at 0.125 bpp. Accuracy values in table 0(b) are evaluated on ImageNet-1K with off-the-shelf Inception-ResNet-V2, MS-SSIM on Kodak. Our method successfully trades off human perception against classification accuracy.

Image compression algorithms aim at finding representations of images that use as little storage – measured in bits – as possible. Opposed to lossless image compression, where the goal is to achieve a high compression rate while requiring perfect reconstruction, lossy image compression enables higher compression rates by allowing for a loss in reconstruction quality. Recently, image compression based on deep neural networks (DNNs) has achieved remarkable results in both lossless [33] and lossy image compression [2, 4, 32, 35, 41, 43], outperforming many traditional codecs. One distinct advantage of such methods is their adaptability to target domains such as medical images on the one hand and training objectives on the other hand. This allows a flexible interpretation of the term quality and ultimately leads to more efficient representations for these tasks. Defining a notion of reconstruction quality is thus a key challenge which crucially depends on the observer of the compressed images. Previous research in lossy image compression expressed quality largely in terms of human visual perception and optimized for the human visual system (HVS), using distortion measures such as multiscale structural similarity [48] (MS-SSIM) or mean squared error (MSE) as training objectives. However, due to recent advances in computer vision systems, increasingly more images are observed solely by machines and bypass humans. Consequently, a natural question that arises is whether or not there exists a relation between quality perceived by humans and machines, and if so, how can we trade off quality between different types of observers? In other words, is a compression system optimized for the human observer also optimal for machines? Or can we leverage certain properties attributed to modern computer vision systems and optimize explicitly for machines? We investigate these questions by specifically looking at classification of natural images as one of the most well studied tasks in computer vision. The training of modern classifiers is typically a costly and time-consuming undertaking and, at the same time, parameters of the best performing classifiers are often made publicly available. With that in mind, we are interested in a compression system that generalizes well in the following sense. On the one hand, we want to compress images such that no further fine-tuning of classifiers on compressed images is required and such that the representations are agnostic to classifier architectures. This enables the use of off-the-shelf classifiers for inference on compressed images. On the other hand, the compression system should also generalize well to other visual tasks such as fine-grained categorization of natural images.

Our proposed method for classification oriented compression relies on a feature reconstruction loss using deep features extracted from the hidden layers of a convolutional neural network trained for image classification. This type of loss function has previously been used as training objective for super-resolution [8, 21, 27], style-transfer [15, 21] and texture synthesis [14], and showed remarkable results compared to other objectives such as MSE. In order to optimize for human visual perception, we make use of MS-SSIM as a measure of quality perceived by humans, since it has been reported to correlate better with human perception than MSE. In order to investigate the trade-off between MS-SSIM and classification accuracy, we look at the convex combination of the two objectives.

In summary, the contributions of our work are threefold:

  • We propose a training objective for deep image compression algorithms that is optimized for subsequent classification by off-the-shelf CNNs, consistently outperforming human optimized compression on three standard image classification datasets in terms of preserved classification accuracy.

  • By looking at the convex combination between human and machine friendly loss, we present a simple way to trade off compression quality in terms of human perception against image classification. Since we only rely on the training objective, our method can be integrated to any learned lossy image compression system.

  • We show how improved compression quality for the human observer comes at the cost of degraded classification accuracy, and vice versa.

2 Related work

Deep image compression

Image compression using DNNs has recently become an active area of research. The most popular types of architectures used for image compression are based on autoencoders [2, 4, 32, 41] and recurrent neural networks [22, 42, 43] (RNNs). Typically, the networks are trained in an end-to-end manner to minimize a pixel-wise notion of distortion such as MSE, MS-SSIM or L1-distances between original and decoded image. Other works make use of adversarial training [35, 37, 45] and Wasserstein distances [45].

Compression for computer vision

Image compression in combination with other computer vision tasks has been studied before in a number of recent works. Liu et al. [29] propose an image compression framework based on JPEG that is favorable to DNN classifiers. Also starting from an engineered codec, Liu et al. [30] propose a 3D image compression framework based on JPEG2000 which is tailored to segmentation of 3-D medical images. Both works differ from ours in that we look at learned image compression, rather than modifying an engineered one. A few examples exist in the literature, where a classifier is learned from features extracted from the encoded representations. Gueguen et al. [17] train a modified ResNet-50 directly on the blockwise discrete cosine transform coefficients from the middle of the JPEG codec. Torfason et al. [44] make use of the compressive autoencoder proposed in [41] and train neural networks for classification and segmentation on the latent (quantized) representations and on the decoded images. These works stand orthogonal to ours in that we do not allow to training compressed versions of images. Rather, we train the compression algorithm such that it maintains information relevant for subsequent classification, keeping the classifiers fixed. We furthermore focus on agnosticity to architectures of inference algorithms. Finally, since compression artifacts typically compromise the performance of classifiers, Dodge and Karam [13] study the effect of JPEG compression artifacts on image classification with neural networks.

Feature reconstruction loss

These objectives make use of deep features extracted from convolutional neural networks. Recent advances in generative modelling have shown that using this type of loss functions, high quality images can be generated. Gatys et al. [14, 15] apply the idea to style transfer and texture synthesis, while Johnson et al. [21] and Bruna et al. [8] achieve remarkable results in super resolution [21, 8] and style transfer [21]. Ledig et al. [27] further develop the idea and enhance the CNN feature loss with adversarial training to achieve state-of-the-art results in single image super resolution. In the image compression domain, steps in this direction have also been made. Agustsson et al. [3], Santurkar et al. [37] and Liu et al. [28] enhance pixel-wise distortion and adversarial training with a feature reconstruction loss. Furthermore, Chinen et al. [9] propose a similarity metric based on deep features extracted from VGG-16 trained for image classification. These works have in common that their focus is on the human observer, while we exploit properties of feature reconstruction loss in the context of compression geared towards subsequent image classification.

3 Method

In this section we outline our approach to compressing images for human visual perception, classification accuracy and the interpolation between the two. Throughout this paper we adopt the compression architecture proposed by Toderici et al. [43], based on recurrent neural networks. We emphasize that we only focus on the objective functions to account for different types of observers.

3.1 Compression framework

For both human and classification oriented compression, let 𝒳d denote a set of training images. Furthemore, let 𝒵 be a set of discrete quantization levels and d:d×d be a notion of distortion between images. Our goal is to find a compression system consisting of an encoder E:dm that maps input images 𝐱 to their latent representation 𝐳=E(𝐱), a quantizer q:m𝒵m that discretizes 𝐳 to 𝐳^=q(𝐳), and a decoder D:𝒵md that maps the quantized representation back to image space, 𝐱^=D(𝐳^). On the one hand, we want the decoded image 𝐱^ to be as similar to their original version as possible, i.e. the distortion d(𝐱,𝐱^) should be small. On the other hand we want to represent 𝐳^ efficiently using a small number of bits. Following [4, 32], we use the entropy as a model of compressibility and wish to minimize the rate-distortion trade-off over the training set 𝒳

𝐱𝒳d(𝐱,𝐱^)+βH(𝐳^) (1)

where H denotes the entropy, and the scalar β0 controls the trade-off between compression rate and distortion. Note that, if L:=|𝒵|<, then

H(𝐳)mlog(L), (2)

in which case it is allowed to set β=0 in (1). A proof of inequality (2) can be found in Cover and Thomas ([12] Theorem 2.6.4). Since we use the compression architecture proposed in [43], we note that 𝒵={-1,+1}. We will thus set β=0 in our experiments.

3.2 Optimizing for human visual perception

In order to optimize the compression system for the human observer, we choose a measure of distortion that approximately models human visual perception. The multiscale structural similarity index (MS-SSIM) [48] is based on the assumption that the human eye is adapted for extracting structural information from images and incorporates image details at multiple resolutions. It is furthermore reported to correlate better with human visual perception than MSE. We thus follow [35, 32, 22] and minimize directly

dH(𝐱,𝐱^)=1-MS-SSIM(𝐱,𝐱^). (3)

We refer to compression optimized with dH as RNN-H.

3.3 Optimizing for classification

Suppose we are given a CNN classifier f trained on a set of images and labels (𝒳,𝒴) and corresponding training and validation splits (𝒳train,𝒴train) and (𝒳val,𝒴val). When we optimize compression for classification, we are interested in finding an encoder, quantizer and decoder such that the accuracy evaluated on the decoded validation set D(q(E(𝒳val))) is maintained as well as possible, without further fine-tuning the classifier on decoded images. Formally, we wish to maximize

|{𝐱𝒳val|argmaxyf(y|𝐱)=argmaxyf(y|𝐱^)}|. (4)

We are thus not interested in matching decoded and original images on a pixel-wise basis, but rather on preserving features which are relevant for subsequent classification. Image classification is a task which is typically invariant to translations and local deformations (see e.g. [6, 7, 31]), which motivates the use of an objective function with similar properties. For example, using a pixel-wise distortion, such as MSE, which is not invariant to such deformations would be a suboptimal choice. Furthermore, if loss functions have unique minimizers – which is the case for MSE – then, for a fixed encoder, the decoder is biased towards generating the mean D(𝐳^)=𝔼[𝐱|q(E(𝐱))=𝐳^]. In other words, high frequency information such as textures will tend to get lost in the compression process. While this is less problematic for the HVS, which is more susceptible to low frequency changes, CNN classifiers are sensitive to any change in frequency as Liu et al. [29] argue.

Features learned by CNNs provide a promising alternative and have been used successfully for image super resolution [8, 21, 27], style transfer [15, 21] and texture synthesis [14]. CNNs trained for object recognition learn a collection of filters that extract a hierarchy of information from image data at different levels of abstraction [26]. They incorporate two crucial aspects in their architecture. Firstly, thanks to rectification and pooling units, CNNs provide stability to small geometric deformations. This is a desirable property for our purpose, since we do not want to put too much emphasis on such deformations. Secondly, they provide features with smaller variance, assuming the input is a locally stationary process – a property inherent to natural images. Bruna and Mallat [7] provide a proof of these properties, in the case where the filters are given by multi-scale wavelets. A further desirable property is that, if features are chosen appropriately, the corresponding reconstruction loss does not have a unique minimizer, thereby alleviating the aforementioned problem with generating the mean and loosing relevant high frequency information. These considerations make distortion measures based on CNN features promising candidates for our purpose.

In order to define a distortion measure that incorporates these properties, we fix a CNN classifier fL trained on a dataset (𝒳′′,𝒴′′). Denote by ϕi the responses of the i-th convolutional layer after activation and let be a set of such layers. Note that is not required to include all layers. We then define the distortion measure associated with the loss network fL and layers to be MSE in feature space

dC,(𝐱,𝐱^)=iβiϕi(𝐱)-ϕi(𝐱^)22, (5)

where βi:=(Hi×Wi×Ci)-1 and Hi,Wi,Ci represent the spatial dimensions of the corresponding layer. Note that we do not restrict the loss network to be trained on the same dataset as the compression system or the classifier f, however we do require that 𝒳′′𝒳val=. Furthermore, the classifier f might have a different underlying architecture than the loss network fL. This formulation allows to investigate the generalizability of the compression system to new datasets and CNN architectures. Throughout this paper we refer to compression optimized with dC, as RNN-C.

3.4 From human visual perception to classification

In a scenario where images are consumed by both humans and classifiers, we would like to be able to trade off reconstruction quality between the two observers. In other words, we want to have a compressed representation of an image that contains features relevant for classification and looks visually pleasing for the human observer. At the same time, this enables us to investigate the relation between human visual perception and classification accuracy. For that purpose, we consider the convex combination between distortions dH and dC,

dα,(𝐱,𝐱^)=(1-α)λHdH(𝐱,𝐱^)+αdC,(𝐱,𝐱^) (6)

and control the trade-off with the parameter α[0, 1]. The parameter λH is a scaling parameter which keeps the two losses on the same magnitude and is set to 5’000.

4 Experiments

Figure 2: Sample image from the Stanford Dogs dataset. Optimizing for HVS, results in smoother and blurrier images. Classification optimized compression on the other hand results in sharp images but suffers from checkerboard-like artifacts stemming from the CNN based loss function. The images are best viewed on colour screen and magnified.

In this section we experimentally validate our approach to trading off compression quality between classification accuracy and human visual perception, making use of the proposed family of loss functions. All models are implemented in Python using the Tensorflow [1] library.

4.1 Experimental setup

Image Compression

We use the RNN compression architecture proposed by Toderici et al. [43] with GRUs and the additive reconstruction framework. Our implementation differs from the original version in two aspects. Firstly, during training, we feed as input the full resolution images, rather than 32×32 image patches. And secondly, instead of optimizing the L1-distance in image space, we use our proposed family of loss functions (6) as training objective. Furthermore, we do not use the lossless entropy coding scheme proposed in their original work. While this would likely result in reduced bitrates, and thereby further improve our results, we omit this in order to reduce complexity and focus exclusively on the distortion during training. If not stated otherwise, we train the networks for 8 unrolling steps, resulting in bpp values in the range [0.125,  …,1.0]. As training data 𝒳, we use the training split of the ILSVRC-2012 [36] dataset, commonly known as ImageNet-1K. We preprocess the images by resizing such that the smallest side equals 256 pixels and aspects are preserved using bilinear interpolation. During training, we take random crops of size 224×224 and randomly flip them horizontally. During validation we use the central crop of size 224×224. We follow [32] and normalize with a mean and variance obtained from a subset of the training set. We train all our networks using the Adam optimizer [24] for three epochs with the learning rate set to 4e-4 and minibatches of size four. All models are trained on eight Nvidia Titan X GPUs with 12GB RAM.

Measures of distortion

We train the compression networks using the loss function proposed in equation (6) and use VGG-16 trained on the ILSVRC-2012 training split as our loss network fL. The weights of the loss network are frozen and left unchanged while training the compression system. We experiment with different values for the parameter α, starting the training each time from scratch. Namely, in order to optimize for human visual perception, we set α=0, while for classification oriented compression, we set α=1. To investigate the trade-off between human vision and classification, we train with α{14,12,34}, also starting training from scratch each time. Since it is not clear which layers in (5) yield optimal results in terms of classification accuracy, we perform a set of experiments with different layers of the loss network fL in order to find an optimal combination. The results of these experiments are presented in detail in section 4.2.

Traditional codecs

We compare learned image compression with the proposed family of loss functions to the traditional compression algorithms JPEG [47], WebP [16] and BPG [5] which achieves state-of-the-art performance in HVS oriented compression. Following [32, 35], BPG is used in the non-default 4:4:4 chroma format.

Classification

In order to evaluate our method and investigate the trade-off between human visual perception and classification, we evaluate a collection of CNN architectures on datasets compressed with different algorithms and at different bitrates. Note that all classifiers are trained on the uncompressed respective training datasets, without further fine-tuning on decoded data. The evaluation procedure is as follows. Since generally, the images do not have the same resolution, we resize them such that the smaller side equals Scomp and aspects are preserved. We then take the central crop of size Scomp×Scomp yielding square images. After this step, given a compression algorithm, we encode the images for a predefined grid of quality parameters and compute the bpp values for each image and quality parameter. For each quality level, we subsequently take the average over the entire validation set, yielding the final bpp values. Finally, we decode and take the central crop of size Sinf×Sinf of the decoded image, which is then fed to the classifier. This results in a set of (bpp, accuracy) points for each classifier and compression method. For CNNs that expect inputs of size Sinf=299 we set Scomp=336 and for such with Sinf=224, we set Scomp=256. The bpp values can thus slightly differ between classifiers for the same dataset, due to the different input sizes.

ImageNet

The ILSVRC-2012 [36] dataset consists of natural images from 1’000 different classes and contains 1’281’167 training images and 50’000 samples for validation. We use DenseNet-121 [20], Inception-ResNet-V2 [39], Inception-V3 [40], MobileNet-V1 [19], ResNet-50 [18], Xception [10] and VGG-16 [38] for inference and use the weights provided by the Keras Library [11].

Stanford Dogs

The Stanford Dogs [23] dataset consists of images of 120 distinct breeds of dogs. This dataset has been built with images from the ImageNet database and is specifically designed for fine-grained visual categorization. The dataset contains a total of 12’000 training and 8’580 validation images. We use Inception-V3, Inception-ResNet-V2, MobileNet-V1, ResNet-50 and VGG-16 to classify images on this dataset. In order to obtain the classifiers, we use ImageNet pre-trained networks and fine-tune all layers on the original uncompressed training split.

CUB-200-2011

The CUB-200-2011 [46] dataset contains images of 200 different species of birds. Similar to the Stanford Dogs dataset, it has been built for the task of fine-grained visual categorization. The dataset contains a total of 5’994 training and 5’794 validation images. We use the same CNN architectures as in the case of Stanford Dogs, and apply the analogous method to obtain the classifiers.

4.2 Results

Area under the Accuracy Curve VGG-1611 1 Loss network ResNet-50 Inception-V3 Inception-ResNet-V2 Xception MobileNet-V1 DenseNet-121 0 0.927 0.734 0.906 0.959 0.897 0.653 0.770 {ϕ1.1} 0.713 0.717 0.820 0.866 0.837 0.647 0.728 {ϕ5.1} 0.963 0.775 0.837 0.934 0.875 0.707 0.831 {ϕ1.1,ϕ5.1} 0.945 0.928 0.950 0.967 0.947 0.887 0.919
Table 1: AUAC ratios on the ILSCVR-2012 validation set. A value of one means perfect reconstruction, while zero indicates loss of all structure relevant to classification. While choosing a deep layer for reconstruction performs best for the loss network (VGG-16), including an early layer improves generalization to other CNN architectures.

In this section we present our main findings. We start by choosing an optimal subset of layers of the loss network, in our formulation of the feature reconstruction loss in (5). We then investigate the trade-off between human visual perception and image classification using RNN compression trained with an increasingly more classification friendly loss function. We then look at compression in terms of classification accuracy in more detail, followed by our results on human visual perception.

Choosing the right features

We perform a series of experiments with different choices of layers used for the reconstruction loss in (5). In order to compare the loss functions quantitatively in terms of classification accuracy, we compute the ratio between the area under the accuracy curve (AUAC) for lossy compression and for the original accuracy between 0.125 and 0.5 bpp. That is, for each subset , we compute r:=(A-A0)/(A*-A0), where A corresponds to the area under the accuracy curve on compressed data, A0 corresponds to random classification and A* to the original accuracy.22 2 We compute A* over the interval [0.125,, 0.5] as a*(0.5-0.125), where a* denotes the original validation accuracy. The value rl=1.0 thus means no loss in accuracy, while rl=0.0 indicates loss of all information relevant to classification. As mentioned before, we choose VGG-16 as our loss network fL. Denote by ϕi.j the responses of the j-th convolutional layer after activation in the i-th block. In order to find a suitable choice of layers , we train RNN compression for three epochs on the ILSVRC-2012 training set for the following choices of . We use the entire set of convolutional layers, denoted by 0, and compare against choosing 1={ϕ1.1}, 2={ϕ5.1} and 3={ϕ1.1,ϕ5.1}. Table 1 summarizes our experiments. We see that choosing only a deep layer results in the best reconstruction quality for the loss network. However this choice is not optimal for other architectures, indicating that other layers are needed for the compression system to generalize to new classification architectures. However, using only the first layer also seems to be suboptimal, indicating that deeper layers store information that is crucial for classification. We see from the table that choosing the combination of early and deep layers, 3, results in the best compression quality in terms of preserved accuracy and yields a compression system that generalizes well across architectures. Finally, this evaluation indicates that we are able to improve over choosing the entire set of convolutions 0 by dropping the middle layers. From now on, we set :={ϕ1.1,ϕ5.1} in (5). We refer the reader to the supplementary material for detailed plots.

From human visual perception to classification

(a) Inception-ResNet-V2, 0.25 bpp
(b) DenseNet-121, 0.25 bpp
(b) DenseNet-121, 0.25 bpp
(c) Inception-ResNet-V2, 1.0 bpp
(d) DenseNet-121, 1.0 bpp
(d) DenseNet-121, 1.0 bpp
Figure 3: MS-SSIM evaluated on Kodak, validation accuracy evaluated on ImageNet-1k with off-the-shelf CNNs. Using our proposed loss functions we are able to trade off classification accuracy against MS-SSIM, where the difference is especially pronounced for low bitrates.

In order to investigate the relation between compression quality perceived by humans in terms of MS-SSIM, and by CNN classifiers, we train the compression networks with loss functions that interpolate between human-friendly and classification-friendly loss, i.e. for values of α in {0,14,12,34, 1}. This trade-off can be seen qualitatively in figure 2. Optimizing for MS-SSIM, results in images that appear smoother and more blurry, especially for the lower bitrate. Classification optimized compression on the other hand results in sharper images but suffers from checkerboard-like artifacts. This type of degradation is a known issue for feature visualization and super resolution (see e.g. [34]) and – in our case – stems from the convolution based loss function which incurs artifacts in gradients. We notice that these differences are more pronounced in the low bitrate regime. In order to quantitatively investigate the trade-off, we visualize the relation in figure 3. On the left axis we plot MS-SSIM evaluated on Kodak, while on the right axis we show the preserved validation accuracy on ImageNet-1K for DenseNet-121 and Inception-ResNet-V2. The figures indicate that generally, we can indeed trade off accuracy against MS-SSIM by optimizing compression with our proposed family of loss functions. Interestingly, we observe that by increasing the trade-off parameter α from 0 to 0.25, we substantially increase accuracy, while the reduction in MS-SSIM is relatively small. The same holds for the other direction. This relation is especially noticeable for DenseNet-121 in the low bitrate regime. Furthermore, we observe that the effect of the choice of trade-off parameter is much more pronounced in the low bitrate regime. Finally, we notice that the traditional codecs are more aligned with RNN-H compression, indicating that they are indeed optimized for the human observer. We refer the reader to the supplementary material for more visual examples.

ImageNet classification

Figure 4 shows the accuracy curves for the Inception-ResNet-V2 architecture on ILSVRC-2012 validation set compressed at different bitrates. Table 2 shows the classification accuracies for a wider collection of CNN architectures. We see that RNN-C compression outperforms both the traditional codecs BPG, WebP and JPEG as well as RNN compression trained with MS-SSIM, across all architectures and bitrates considered. In the case of the loss network VGG-16 this is to be extpected, since we explicitly train the compression network to produce images whose VGG-features – which are fed to the fully connected layers for classification – match the ones from their original counterparts. Interestingly however, we see that the RNN-C compression system generalizes well to architectures different from the loss network and maintains the accuracy remarkably well, also at very low bitrates. Figure 7 shows a sample image where our method performs worse than BPG compression in terms of classification. We provide the detailed accuracy curves for all classifiers in the supplementary material.

ILSCVRC-2012 Top-1 Validation Accuracy 224×224 input 299×299 input bpp DenseNet-121 MobileNet-V1 ResNet-50 VGG-1633 3 Loss network used to train RNN-C compression. bpp Inception-V3 Xception Low Bitrates (0.125 bpp) RNN-C 0.125 0.5914 0.4903 0.6092 0.6009 0.125 0.6691 0.6815 RNN-H 0.125 0.4109 0.3221 0.4309 0.3797 0.125 0.5550 0.5722 BPG 0.157 0.4661 0.3421 0.4711 0.4448 0.132 0.5750 0.6050 JPEG 0.136 0.0480 0.0493 0.0426 0.0320 0.113 0.2675 0.2166 Medium Bitrates (0.5 bpp) RNN-C 0.500 0.7200 0.6190 0.7206 0.6964 0.500 0.7646 0.7741 RNN-H 0.500 0.6483 0.5426 0.6530 0.6234 0.500 0.7288 0.7400 BPG 0.555 0.6612 0.5541 0.6645 0.6420 0.581 0.7377 0.7523 WebP 0.517 0.6084 0.4991 0.6075 0.6008 0.485 0.7072 0.7255 JPEG 0.474 0.5798 0.4638 0.5995 0.5914 0.515 0.7174 0.7297 High Bitrates (1.0 bpp) RNN-C 1.000 0.7303 0.6347 0.7316 0.7037 1.000 0.7730 0.7840 RNN-H 1.000 0.6998 0.5984 0.7018 0.6732 1.000 0.7631 0.7728 BPG 1.048 0.7085 0.6151 0.7168 0.6841 1.066 0.7618 0.7756 WebP 0.997 0.6930 0.6050 0.7039 0.6829 1.055 0.7589 0.7699 JPEG 1.087 0.6808 0.5865 0.6963 0.6918 0.962 0.7517 0.7622 Original - 0.7453 0.6590 0.7465 0.7088 - 0.7786 0.7907
Table 2: Validation accuracy on ILSVRC-2012 compressed at different bitrates. RNN-C outperforms BPG, JPEG, WebP and RNN-H across bitrates and CNN architectures. RNN compression is trained from scratch on ILSVRC-2012 training data. The classifiers are trained on the original (i.e. not further compressed) ILSVRC-2012 training split, and not finetuned on compressed images.
Figure 4: Inception-ResNet-V2 Top-1 Validation Accuracy on ILSVRC-2012 compressed at different bitrates. RNN-C outperforms BPG, JPEG, WebP and RNN-H across bitrates. The classifier is trained on the uncompressed ILSVRC-2012 training split.
(a) Inception-ResNet-V2 on Stanford Dogs
(b) Inception-ResNet-V2 on CUB-200-2011
Figure 5: Validation accuracy for fine-grained image classification. RNN-C outperforms BPG, JPEG, WebP and RNN-H across the majority of bitrates on both datasets. RNN compression is trained from scratch on ILSVRC-2012 training data.

Fine-grained visual categorization

In order to explore the generalization properties of our compression system to new tasks, we evaluate our method on two well known datasets for fine-grained visual categorization, namely Stanford Dogs and the challenging CUB-200-2011. Note that the compression system is trained on the ILSVRC-2012 training split, i.e. we look at the case where 𝒳 and 𝒳 in (4) are not equal. Figures 4(a) and 4(b) indicate that our classification oriented compression method outperforms both the traditional codecs and RNN-H compression in terms of preserved classification accuracy with Inception-ResNet-V2 on both datasets. Similarly to ImageNet-1K classification, we see that the difference is especially pronounced in the regime below 0.5 bpp. We provide evaluation plots for the other CNN architectures in the supplementary material.

Human visual perception

(a) Kodak
(b) Stanford Dogs
(c) CUB-200-2011
(d) ILSVRC-2012
Figure 6: Performance of compression systems evaluated on different datasets, measured in terms of MS-SSIM. In each figure, RNN compression was trained on the ILSVRC-2012 training set.
(a) Original
”Croquet Ball” (63.7%)
(b) BPG, 0.086 bpp
”Croquet Ball” (72.5%)
(c) RNN-C,
0.125 bpp
”Abacus” (30.6%)
Figure 7: Sample from ILSVRC-2012 where our method is worse than BPG. Predictions with Inception-ResNet-V2.
(a) Original
(b) BPG
MS-SSIM: 0.9871
(c) RNN-H
MS-SSIM: 0.9728
Figure 8: Sample from ILSVRC-2012 compressed at 0.125 bpp where our method is worse than BPG. BPG manages to reconstruct fine details, while our method results in overly smooth images.

We evaluate the compression quality perceived by the human observer on the validation splits of the ILSVRC-2012, Stanford Dogs and CUB-200-2011 dataset. We additionally evaluate on the widely used Kodak Photo CD dataset [25]. The procedure to compare the compression methods is as follows. On ILSVRC-2012, Stanford Dogs and CUB-200-2011 we resize each image with bilinear interpolation such that the smallest side equals 256 pixels and aspects are preserved. We then take the central crop of size 256×256. Since the Kodak images are all of equal resolution, we skip this first resizing step and keep the original resolution. Subsequently, we compress the images using a predefined grid of quality parameters and compute their bpp values which are then averaged over the entire validation set. Finally, we compute the MS-SSIM scores between decoded and original (resized) image. This yields a set of (bpp, MS-SSIM) points for each compression method. Figure 6 shows that human oriented RNN-H compression clearly outperforms classification optimized RNN-C compression across datasets and bitrates. Comparing our method against the traditional codecs, we see that RNN-H slightly outperforms BPG on Kodak and Stanford Dogs and performs comparable to BPG on CUB-200-2011. On ILSVRC-2012 however, BPG outperforms RNN-H with respect to MS-SSIM, especially for very low bitrates (< 0.5 bpp). On all datasets considered, RNN-H clearly outperforms WebP and JPEG, while RNN-C performs comparable to JPEG. Recall that, while we evaluate on different datasets, the compression system is trained on the ILSVRC-2012 training set. Figure 8 shows a sample where RNN-H performs worse than BPG in terms of MS-SSIM.

5 Conclusion

In this paper we investigate the trade-off in learned image compression with RNNs [43] between human visual perception and image classification. To that end, we propose a family of loss functions that on the one hand enables us to optimize compression for the human observer by directly maximizing MS-SSIM. On the other hand – for a different value of the trade-off parameter – we are able to train compression towards subsequent image classification. Our experiments have shown that when using the human friendly loss, RNN compression achieves performance comparable to the state-of-the-art traditional codec BPG [5] and consistently outperforms both JPEG and WebP on four different datasets (namely Kodak, ILSVRC-2012, Stanford Dogs and CUB-200-2011). Our classification friendly loss, which is based on deep features extracted from VGG-16 trained for object recognition, induces a compression system which consistently and by a large margin outperforms all traditional codecs in terms of preserved classification accuracy. Our experiments furthermore indicate that our method is agnostic to the CNN architecture used for classification and does not require that the classifiers are finetuned on compressed image data, enabling the deployment of off-the-shelf publicly available classifiers. This suggests that we can indeed leverage properties attributed to CNN classifiers for our purpose and thereby explicitly optimize image compression for classification. Finally, we find that by moving the loss function only marginially towards classification, we can substantially increase the preserved accuracy while incurring only a small reduction in MS-SSIM, and vice versa. This improves compression in an environment where images are consumed by both humans and classifiers and allows a user to trade off reconstruction quality dependent on the observer.

Highlighting the limitations of our method, we first notice that introducing a feature reconstruction loss results in a need for labelled image data, in contrast to the traditionally unsupervised nature of learned image compression. Furthermore, we emphasize that we use MS-SSIM as a model for image similarity perceived by humans. While being a widely adopted measure of distortion in the compression literature, it is only an approximation to a true model of similarity perceived by the human visual system.

Future works could investigate more computer vision tasks in light of machine oriented compression, and thereby explore the generalization properties of compression systems trained with our proposed loss functions. A further interesting line of future research could involve using a loss network in the feature reconstruction loss that was trained in a self-supervised or unsupervised manner, eliminating the need for labelled data.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations. In Advances in Neural Information Processing Systems 30, pages 1141–1151. Curran Associates, Inc., 2017.
  • [3] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool. Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958, 2018.
  • [4] J. Ballé, V. Laparra, and E. P. Simoncelli. End-to-end optimized image compression. In International Conference on Learning Representations (ICLR), 2017.
  • [5] F. Bellard. Bpg image format, 2014. https://bellard.org/bpg/.
  • [6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, July 2017.
  • [7] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.
  • [8] J. Bruna, P. Sprechmann, and Y. LeCun. Super-resolution with deep convolutional sufficient statistics. In International Conference on Learning Representations (ICLR), May 2016.
  • [9] T. Chinen, J. Ballé, C. Gu, S. J. Hwang, S. Ioffe, N. Johnston, T. Leung, D. Minnen, S. O’Malley, C. Rosenberg, and G. Toderici. Towards a semantic perceptual image metric. In 2018 25th IEEE International Conference on Image Processing (ICIP), October 2018.
  • [10] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [11] F. Chollet et al. Keras. https://keras.io, 2015.
  • [12] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006.
  • [13] S. Dodge and L. Karam. Understanding how image quality affects deep neural networks. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2016.
  • [14] L. Gatys, A. S Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 262–270. Curran Associates, Inc., 2015.
  • [15] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [16] Google. Webp image format. https://developers.google.com/speed/webp/, 2015. Accessed: 2019-03-17.
  • [17] L. Gueguen, A. Sergeev, B. Kadlec, R. Liu, and J. Yosinski. Faster neural networks straight from jpeg. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3933–3944. Curran Associates, Inc., 2018.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [19] A. G. Howard, M. Zhu, .o Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
  • [20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [21] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision (ECCV), pages 694–711, 2016.
  • [22] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, . Jin Hwang, J. Shor, and G. Toderici. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [23] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, June 2011.
  • [24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), May 2015.
  • [25] E. Kodak. Kodak lossless true color image suite (PhotoCD PCD0992).
  • [26] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 253–256, 2010.
  • [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [28] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma. Deep image compression via end-to-end learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
  • [29] Z. Liu, T. Liu, W. Wen, L. Jiang, J. Xu, Y. Wang, and G. Quan. Deepn-jpeg: A deep neural network favorable jpeg-based image compression framework. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6, 2018.
  • [30] Z. Liu, X. Xu, T. Liu, Q. Liu, Y. Wang, Y. Shi, W. Wen, M. Huang, H. Yuan, and J. Zhuang. Machine vision guided 3d medical image compression for efficient transmission and accurate segmentation in the clouds. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [31] S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A, 374(2065), 2016.
  • [32] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Conditional probability models for deep image compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [33] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Practical full resolution learned lossless image compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [34] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
  • [35] O. Rippel and L. Bourdev. Real-time adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2922–2930, Aug. 2017.
  • [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [37] S. Santurkar, D. Budden, and N. Shavit. Generative compression. 2018 Picture Coding Symposium (PCS), pages 258–262, 2018.
  • [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), May 2015.
  • [39] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pages 4278–4284. AAAI Press, 2017.
  • [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [41] L. Theis, W. Shi, A. Cunningham, and F. Huszár. Lossy image compression with compressive autoencoders. In International Conference on Learning Representations (ICLR), 2017.
  • [42] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar. Variable rate image compression with recurrent neural networks. In International Conference on Learning Representations (ICLR), 2016.
  • [43] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5435–5443, July 2017.
  • [44] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Towards image understanding from deep compression without decoding. In International Conference on Learning Representations (ICLR), Apr. 2018.
  • [45] M. Tschannen, E. Agustsson, and M. Lucic. Deep generative models for distribution-preserving lossy compression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 5929–5940. Curran Associates, Inc., 2018.
  • [46] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [47] G. K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992.
  • [48] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, volume 2, pages 1398–1402, 2003.

Appendix A Supplementary material

A.1 Visual examples

Original
RNN-H (Ours w/ α=0), 0.125 bpp
BPG, 0.119 bpp
RNN-C (Ours w/ α=1), 0.125 bpp
Figure 9: Our method compared to BPG on the 13th Kodak image.
Original
RNN-H (Ours w/ α=0), 0.125 bpp
BPG, 0.131 bpp
RNN-C (Ours w/ α=1), 0.125 bpp
Figure 10: Our method compared to BPG on a sample image from the ILSVRC-2012 validation set.

A.2 Choosing deep CNN features for reconstruction

(a) VGG-16
(b) ResNet-50
(c) Inception-V3
(d) Xception
(e) MobileNet-V1
(f) Inception-ResNet-V2
(g) DenseNet-121
Figure 11: Validation accuracy on ILSVRC-2012. RNN-C compression is trained on the ImageNet training split using different choices of VGG-16 layers. We see that the subset ={ϕ1.1,ϕ5.1} inhibits the best generalization properties in terms of agnosticity to CNN architecture

A.3 ImageNet-1K classification

(a) DesneNet-121
(b) Inception-V3
(c) MobileNet-V1
(d) ResNet-50
(e) VGG-16
(f) Xception
Figure 12: Validation accuracy on ILSVRC-2012, compressed at different bitrates. RNN-C outperforms BPG, JPEG, WebP and RNN-H across bitrates and CNN architectures. RNN compression is trained on ILSVRC-2012 training data. The classifiers are trained on the original (uncompressed) ILSVRC-2012 training split, and not finetuned on compressed images.

A.4 Fine-grained visual categorization

(a) Inception-V3
(b) MobileNet-V1
(c) ResNet-50
(d) VGG-16
Figure 13: Validation accuracy for Stanford Dogs, compressed at different bitrates. RNN-C outperforms BPG, JPEG, WebP and RNN-H across bitrates and CNN architectures. RNN compression is trained on ILSVRC-2012 training data. The classifiers are trained on the original (uncompressed) CUB-200-2011 training split, and not finetuned on compressed images.
(a) Inception-V3
(b) MobileNet-V1
(c) ResNet-50
(d) VGG-16
Figure 14: Validation accuracy for CUB-200-2011, compressed at different bitrates. RNN-C outperforms BPG, JPEG, WebP and RNN-H across bitrates and CNN architectures. RNN compression is trained on ILSVRC-2012 training data. The classifiers are trained on the original (uncompressed) CUB-200-2011 training split, and not finetuned on compressed images.