Abstract
Deep neural networks have recently advanced the stateoftheart in imagecompression and surpassed many traditional compression algorithms. The trainingof such networks involves carefully trading off entropy of the latentrepresentation against reconstruction quality. The term quality cruciallydepends on the observer of the images which, in the vast majority ofliterature, is assumed to be human. In this paper, we go beyond this notion ofquality and look at human visual perception and machine perceptionsimultaneously. To that end, we propose a family of loss functions that allowsto optimize deep image compression depending on the observer and to interpolatebetween human perceived visual quality and classification accuracy. Ourexperiments show that our proposed training objectives result in compressionsystems that, when trained with machine friendly loss, preserve accuracy muchbetter than the traditional codecs BPG, WebP and JPEG, without requiringfinetuning of inference algorithms on decoded images and independent of theclassifier architecture. At the same time, when using the human friendly loss,we achieve competitive performance in terms of MSSSIM.
Quick Read (beta)
Lossy Image Compression with Recurrent Neural Networks: from Human Perceived Visual Quality to Classification Accuracy
Abstract
Deep neural networks have recently advanced the stateoftheart in image compression and surpassed many traditional compression algorithms. The training of such networks involves carefully trading off entropy of the latent representation against reconstruction quality. The term quality crucially depends on the observer of the images which, in the vast majority of literature, is assumed to be human. In this paper, we go beyond this notion of quality and look at human visual perception and machine perception simultaneously. To that end, we propose a family of loss functions that allows to optimize deep image compression depending on the observer and to interpolate between human perceived visual quality and classification accuracy. Our experiments show that our proposed training objectives result in compression systems that, when trained with machine friendly loss, preserve accuracy much better than the traditional codecs BPG, WebP and JPEG, without requiring finetuning of inference algorithms on decoded images and independent of the classifier architecture. At the same time, when using the human friendly loss, we achieve competitive performance in terms of MSSSIM.
arabic
1 Introduction




Image compression algorithms aim at finding representations of images that use as little storage – measured in bits – as possible. Opposed to lossless image compression, where the goal is to achieve a high compression rate while requiring perfect reconstruction, lossy image compression enables higher compression rates by allowing for a loss in reconstruction quality. Recently, image compression based on deep neural networks (DNNs) has achieved remarkable results in both lossless [33] and lossy image compression [2, 4, 32, 35, 41, 43], outperforming many traditional codecs. One distinct advantage of such methods is their adaptability to target domains such as medical images on the one hand and training objectives on the other hand. This allows a flexible interpretation of the term quality and ultimately leads to more efficient representations for these tasks. Defining a notion of reconstruction quality is thus a key challenge which crucially depends on the observer of the compressed images. Previous research in lossy image compression expressed quality largely in terms of human visual perception and optimized for the human visual system (HVS), using distortion measures such as multiscale structural similarity [48] (MSSSIM) or mean squared error (MSE) as training objectives. However, due to recent advances in computer vision systems, increasingly more images are observed solely by machines and bypass humans. Consequently, a natural question that arises is whether or not there exists a relation between quality perceived by humans and machines, and if so, how can we trade off quality between different types of observers? In other words, is a compression system optimized for the human observer also optimal for machines? Or can we leverage certain properties attributed to modern computer vision systems and optimize explicitly for machines? We investigate these questions by specifically looking at classification of natural images as one of the most well studied tasks in computer vision. The training of modern classifiers is typically a costly and timeconsuming undertaking and, at the same time, parameters of the best performing classifiers are often made publicly available. With that in mind, we are interested in a compression system that generalizes well in the following sense. On the one hand, we want to compress images such that no further finetuning of classifiers on compressed images is required and such that the representations are agnostic to classifier architectures. This enables the use of offtheshelf classifiers for inference on compressed images. On the other hand, the compression system should also generalize well to other visual tasks such as finegrained categorization of natural images.
Our proposed method for classification oriented compression relies on a feature reconstruction loss using deep features extracted from the hidden layers of a convolutional neural network trained for image classification. This type of loss function has previously been used as training objective for superresolution [8, 21, 27], styletransfer [15, 21] and texture synthesis [14], and showed remarkable results compared to other objectives such as MSE. In order to optimize for human visual perception, we make use of MSSSIM as a measure of quality perceived by humans, since it has been reported to correlate better with human perception than MSE. In order to investigate the tradeoff between MSSSIM and classification accuracy, we look at the convex combination of the two objectives.
In summary, the contributions of our work are threefold:

•
We propose a training objective for deep image compression algorithms that is optimized for subsequent classification by offtheshelf CNNs, consistently outperforming human optimized compression on three standard image classification datasets in terms of preserved classification accuracy.

•
By looking at the convex combination between human and machine friendly loss, we present a simple way to trade off compression quality in terms of human perception against image classification. Since we only rely on the training objective, our method can be integrated to any learned lossy image compression system.

•
We show how improved compression quality for the human observer comes at the cost of degraded classification accuracy, and vice versa.
2 Related work
Deep image compression
Image compression using DNNs has recently become an active area of research. The most popular types of architectures used for image compression are based on autoencoders [2, 4, 32, 41] and recurrent neural networks [22, 42, 43] (RNNs). Typically, the networks are trained in an endtoend manner to minimize a pixelwise notion of distortion such as MSE, MSSSIM or ${L}_{1}$distances between original and decoded image. Other works make use of adversarial training [35, 37, 45] and Wasserstein distances [45].
Compression for computer vision
Image compression in combination with other computer vision tasks has been studied before in a number of recent works. Liu et al. [29] propose an image compression framework based on JPEG that is favorable to DNN classifiers. Also starting from an engineered codec, Liu et al. [30] propose a 3D image compression framework based on JPEG2000 which is tailored to segmentation of 3D medical images. Both works differ from ours in that we look at learned image compression, rather than modifying an engineered one. A few examples exist in the literature, where a classifier is learned from features extracted from the encoded representations. Gueguen et al. [17] train a modified ResNet50 directly on the blockwise discrete cosine transform coefficients from the middle of the JPEG codec. Torfason et al. [44] make use of the compressive autoencoder proposed in [41] and train neural networks for classification and segmentation on the latent (quantized) representations and on the decoded images. These works stand orthogonal to ours in that we do not allow to training compressed versions of images. Rather, we train the compression algorithm such that it maintains information relevant for subsequent classification, keeping the classifiers fixed. We furthermore focus on agnosticity to architectures of inference algorithms. Finally, since compression artifacts typically compromise the performance of classifiers, Dodge and Karam [13] study the effect of JPEG compression artifacts on image classification with neural networks.
Feature reconstruction loss
These objectives make use of deep features extracted from convolutional neural networks. Recent advances in generative modelling have shown that using this type of loss functions, high quality images can be generated. Gatys et al. [14, 15] apply the idea to style transfer and texture synthesis, while Johnson et al. [21] and Bruna et al. [8] achieve remarkable results in super resolution [21, 8] and style transfer [21]. Ledig et al. [27] further develop the idea and enhance the CNN feature loss with adversarial training to achieve stateoftheart results in single image super resolution. In the image compression domain, steps in this direction have also been made. Agustsson et al. [3], Santurkar et al. [37] and Liu et al. [28] enhance pixelwise distortion and adversarial training with a feature reconstruction loss. Furthermore, Chinen et al. [9] propose a similarity metric based on deep features extracted from VGG16 trained for image classification. These works have in common that their focus is on the human observer, while we exploit properties of feature reconstruction loss in the context of compression geared towards subsequent image classification.
3 Method
In this section we outline our approach to compressing images for human visual perception, classification accuracy and the interpolation between the two. Throughout this paper we adopt the compression architecture proposed by Toderici et al. [43], based on recurrent neural networks. We emphasize that we only focus on the objective functions to account for different types of observers.
3.1 Compression framework
For both human and classification oriented compression, let $\mathcal{X}\subseteq {\mathbb{R}}^{d}$ denote a set of training images. Furthemore, let $\mathcal{Z}\subseteq \mathbb{Z}$ be a set of discrete quantization levels and $d:{\mathbb{R}}^{d}\times {\mathbb{R}}^{d}\to \mathbb{R}$ be a notion of distortion between images. Our goal is to find a compression system consisting of an encoder $E:{\mathbb{R}}^{d}\to {\mathbb{R}}^{m}$ that maps input images $\mathbf{x}$ to their latent representation $\mathbf{z}=E(\mathbf{x})$, a quantizer $q:{\mathbb{R}}^{m}\to {\mathcal{Z}}^{m}$ that discretizes $\mathbf{z}$ to $\widehat{\mathbf{z}}=q(\mathbf{z})$, and a decoder $D:{\mathcal{Z}}^{m}\to {\mathbb{R}}^{d}$ that maps the quantized representation back to image space, $\widehat{\mathbf{x}}=D(\widehat{\mathbf{z}})$. On the one hand, we want the decoded image $\widehat{\mathbf{x}}$ to be as similar to their original version as possible, i.e. the distortion $d(\mathbf{x},\widehat{\mathbf{x}})$ should be small. On the other hand we want to represent $\widehat{\mathbf{z}}$ efficiently using a small number of bits. Following [4, 32], we use the entropy as a model of compressibility and wish to minimize the ratedistortion tradeoff over the training set $\mathcal{X}$
$$\sum _{\mathbf{x}\in \mathcal{X}}d(\mathbf{x},\widehat{\mathbf{x}})+\beta H(\widehat{\mathbf{z}})$$  (1) 
where $H$ denotes the entropy, and the scalar $\beta \ge 0$ controls the tradeoff between compression rate and distortion. Note that, if $$, then
$$H(\mathbf{z})\le m\cdot \mathrm{log}(L),$$  (2) 
in which case it is allowed to set $\beta =0$ in (1). A proof of inequality (2) can be found in Cover and Thomas ([12] Theorem 2.6.4). Since we use the compression architecture proposed in [43], we note that $\mathcal{Z}=\{1,+1\}$. We will thus set $\beta =0$ in our experiments.
3.2 Optimizing for human visual perception
In order to optimize the compression system for the human observer, we choose a measure of distortion that approximately models human visual perception. The multiscale structural similarity index (MSSSIM) [48] is based on the assumption that the human eye is adapted for extracting structural information from images and incorporates image details at multiple resolutions. It is furthermore reported to correlate better with human visual perception than MSE. We thus follow [35, 32, 22] and minimize directly
$${d}_{H}(\mathbf{x},\widehat{\mathbf{x}})=1\text{MSSSIM}(\mathbf{x},\widehat{\mathbf{x}}).$$  (3) 
We refer to compression optimized with ${d}_{H}$ as RNNH.
3.3 Optimizing for classification
Suppose we are given a CNN classifier $f$ trained on a set of images and labels $({\mathcal{X}}^{\prime},{\mathcal{Y}}^{\prime})$ and corresponding training and validation splits $({\mathcal{X}}_{train}^{\prime},{\mathcal{Y}}_{train}^{\prime})$ and $({\mathcal{X}}_{val}^{\prime},{\mathcal{Y}}_{val}^{\prime})$. When we optimize compression for classification, we are interested in finding an encoder, quantizer and decoder such that the accuracy evaluated on the decoded validation set $D(q(E({\mathcal{X}}_{val}^{\prime})))$ is maintained as well as possible, without further finetuning the classifier on decoded images. Formally, we wish to maximize
$$\left\{\mathbf{x}\in {\mathcal{X}}_{val}^{\prime}\mathrm{arg}\underset{y}{\mathrm{max}}f(y\mathbf{x})=\mathrm{arg}\underset{y}{\mathrm{max}}f(y\widehat{\mathbf{x}})\}\right.$$  (4) 
We are thus not interested in matching decoded and original images on a pixelwise basis, but rather on preserving features which are relevant for subsequent classification. Image classification is a task which is typically invariant to translations and local deformations (see e.g. [6, 7, 31]), which motivates the use of an objective function with similar properties. For example, using a pixelwise distortion, such as MSE, which is not invariant to such deformations would be a suboptimal choice. Furthermore, if loss functions have unique minimizers – which is the case for MSE – then, for a fixed encoder, the decoder is biased towards generating the mean $D(\widehat{\mathbf{z}})=\mathbb{E}[\mathbf{x}q(E(\mathbf{x}))=\widehat{\mathbf{z}}]$. In other words, high frequency information such as textures will tend to get lost in the compression process. While this is less problematic for the HVS, which is more susceptible to low frequency changes, CNN classifiers are sensitive to any change in frequency as Liu et al. [29] argue.
Features learned by CNNs provide a promising alternative and have been used successfully for image super resolution [8, 21, 27], style transfer [15, 21] and texture synthesis [14]. CNNs trained for object recognition learn a collection of filters that extract a hierarchy of information from image data at different levels of abstraction [26]. They incorporate two crucial aspects in their architecture. Firstly, thanks to rectification and pooling units, CNNs provide stability to small geometric deformations. This is a desirable property for our purpose, since we do not want to put too much emphasis on such deformations. Secondly, they provide features with smaller variance, assuming the input is a locally stationary process – a property inherent to natural images. Bruna and Mallat [7] provide a proof of these properties, in the case where the filters are given by multiscale wavelets. A further desirable property is that, if features are chosen appropriately, the corresponding reconstruction loss does not have a unique minimizer, thereby alleviating the aforementioned problem with generating the mean and loosing relevant high frequency information. These considerations make distortion measures based on CNN features promising candidates for our purpose.
In order to define a distortion measure that incorporates these properties, we fix a CNN classifier ${f}_{L}$ trained on a dataset $({\mathcal{X}}^{\prime \prime},{\mathcal{Y}}^{\prime \prime})$. Denote by ${\varphi}_{i}$ the responses of the ith convolutional layer after activation and let $\mathcal{I}$ be a set of such layers. Note that $\mathcal{I}$ is not required to include all layers. We then define the distortion measure associated with the loss network ${f}_{L}$ and layers $\mathcal{I}$ to be MSE in feature space
$${d}_{C,\mathcal{I}}(\mathbf{x},\widehat{\mathbf{x}})=\sum _{i\in \mathcal{I}}{\beta}_{i}{\parallel {\varphi}_{i}(\mathbf{x}){\varphi}_{i}(\widehat{\mathbf{x}})\parallel}_{2}^{2},$$  (5) 
where ${\beta}_{i}:={({H}_{i}\times {W}_{i}\times {C}_{i})}^{1}$ and ${H}_{i},{W}_{i},{C}_{i}$ represent the spatial dimensions of the corresponding layer. Note that we do not restrict the loss network to be trained on the same dataset as the compression system or the classifier $f$, however we do require that ${\mathcal{X}}^{\prime \prime}\cap {\mathcal{X}}_{val}^{\prime}=\mathrm{\varnothing}$. Furthermore, the classifier $f$ might have a different underlying architecture than the loss network ${f}_{L}$. This formulation allows to investigate the generalizability of the compression system to new datasets and CNN architectures. Throughout this paper we refer to compression optimized with ${d}_{C,\mathcal{I}}$ as RNNC.
3.4 From human visual perception to classification
In a scenario where images are consumed by both humans and classifiers, we would like to be able to trade off reconstruction quality between the two observers. In other words, we want to have a compressed representation of an image that contains features relevant for classification and looks visually pleasing for the human observer. At the same time, this enables us to investigate the relation between human visual perception and classification accuracy. For that purpose, we consider the convex combination between distortions ${d}_{H}$ and ${d}_{C,\mathcal{I}}$
$${d}_{\alpha ,\mathcal{I}}(\mathbf{x},\widehat{\mathbf{x}})=(1\alpha )\cdot {\lambda}_{H}\cdot {d}_{H}(\mathbf{x},\widehat{\mathbf{x}})+\alpha \cdot {d}_{C,\mathcal{I}}(\mathbf{x},\widehat{\mathbf{x}})$$  (6) 
and control the tradeoff with the parameter $\alpha \in [0,\mathrm{\hspace{0.17em}1}]$. The parameter ${\lambda}_{H}$ is a scaling parameter which keeps the two losses on the same magnitude and is set to 5’000.
4 Experiments
In this section we experimentally validate our approach to trading off compression quality between classification accuracy and human visual perception, making use of the proposed family of loss functions. All models are implemented in Python using the Tensorflow [1] library.
4.1 Experimental setup
Image Compression
We use the RNN compression architecture proposed by Toderici et al. [43] with GRUs and the additive reconstruction framework. Our implementation differs from the original version in two aspects. Firstly, during training, we feed as input the full resolution images, rather than 32$\times $32 image patches. And secondly, instead of optimizing the ${L}_{1}$distance in image space, we use our proposed family of loss functions (6) as training objective. Furthermore, we do not use the lossless entropy coding scheme proposed in their original work. While this would likely result in reduced bitrates, and thereby further improve our results, we omit this in order to reduce complexity and focus exclusively on the distortion during training. If not stated otherwise, we train the networks for 8 unrolling steps, resulting in bpp values in the range [0.125, …,1.0]. As training data $\mathcal{X}$, we use the training split of the ILSVRC2012 [36] dataset, commonly known as ImageNet1K. We preprocess the images by resizing such that the smallest side equals 256 pixels and aspects are preserved using bilinear interpolation. During training, we take random crops of size 224$\times $224 and randomly flip them horizontally. During validation we use the central crop of size 224$\times $224. We follow [32] and normalize with a mean and variance obtained from a subset of the training set. We train all our networks using the Adam optimizer [24] for three epochs with the learning rate set to 4e4 and minibatches of size four. All models are trained on eight Nvidia Titan X GPUs with 12GB RAM.
Measures of distortion
We train the compression networks using the loss function proposed in equation (6) and use VGG16 trained on the ILSVRC2012 training split as our loss network ${f}_{L}$. The weights of the loss network are frozen and left unchanged while training the compression system. We experiment with different values for the parameter $\alpha $, starting the training each time from scratch. Namely, in order to optimize for human visual perception, we set $\alpha =0$, while for classification oriented compression, we set $\alpha =1$. To investigate the tradeoff between human vision and classification, we train with $\alpha \in \{\frac{1}{4},\frac{1}{2},\frac{3}{4}\}$, also starting training from scratch each time. Since it is not clear which layers $\mathcal{I}$ in (5) yield optimal results in terms of classification accuracy, we perform a set of experiments with different layers of the loss network ${f}_{L}$ in order to find an optimal combination. The results of these experiments are presented in detail in section 4.2.
Traditional codecs
Classification
In order to evaluate our method and investigate the tradeoff between human visual perception and classification, we evaluate a collection of CNN architectures on datasets compressed with different algorithms and at different bitrates. Note that all classifiers are trained on the uncompressed respective training datasets, without further finetuning on decoded data. The evaluation procedure is as follows. Since generally, the images do not have the same resolution, we resize them such that the smaller side equals ${S}_{comp}$ and aspects are preserved. We then take the central crop of size ${S}_{comp}\times {S}_{comp}$ yielding square images. After this step, given a compression algorithm, we encode the images for a predefined grid of quality parameters and compute the bpp values for each image and quality parameter. For each quality level, we subsequently take the average over the entire validation set, yielding the final bpp values. Finally, we decode and take the central crop of size ${S}_{inf}\times {S}_{inf}$ of the decoded image, which is then fed to the classifier. This results in a set of (bpp, accuracy) points for each classifier and compression method. For CNNs that expect inputs of size ${S}_{inf}=299$ we set ${S}_{comp}=336$ and for such with ${S}_{inf}=224$, we set ${S}_{comp}=256$. The bpp values can thus slightly differ between classifiers for the same dataset, due to the different input sizes.
ImageNet
The ILSVRC2012 [36] dataset consists of natural images from 1’000 different classes and contains 1’281’167 training images and 50’000 samples for validation. We use DenseNet121 [20], InceptionResNetV2 [39], InceptionV3 [40], MobileNetV1 [19], ResNet50 [18], Xception [10] and VGG16 [38] for inference and use the weights provided by the Keras Library [11].
Stanford Dogs
The Stanford Dogs [23] dataset consists of images of 120 distinct breeds of dogs. This dataset has been built with images from the ImageNet database and is specifically designed for finegrained visual categorization. The dataset contains a total of 12’000 training and 8’580 validation images. We use InceptionV3, InceptionResNetV2, MobileNetV1, ResNet50 and VGG16 to classify images on this dataset. In order to obtain the classifiers, we use ImageNet pretrained networks and finetune all layers on the original uncompressed training split.
CUB2002011
The CUB2002011 [46] dataset contains images of 200 different species of birds. Similar to the Stanford Dogs dataset, it has been built for the task of finegrained visual categorization. The dataset contains a total of 5’994 training and 5’794 validation images. We use the same CNN architectures as in the case of Stanford Dogs, and apply the analogous method to obtain the classifiers.
4.2 Results
In this section we present our main findings. We start by choosing an optimal subset of layers of the loss network, in our formulation of the feature reconstruction loss in (5). We then investigate the tradeoff between human visual perception and image classification using RNN compression trained with an increasingly more classification friendly loss function. We then look at compression in terms of classification accuracy in more detail, followed by our results on human visual perception.
Choosing the right features
We perform a series of experiments with different choices of layers used for the reconstruction loss in (5). In order to compare the loss functions quantitatively in terms of classification accuracy, we compute the ratio between the area under the accuracy curve (AUAC) for lossy compression and for the original accuracy between 0.125 and 0.5 bpp. That is, for each subset $\mathcal{I}$, we compute $r:=(A{A}_{0})/({A}^{*}{A}_{0})$, where $A$ corresponds to the area under the accuracy curve on compressed data, ${A}_{0}$ corresponds to random classification and ${A}^{*}$ to the original accuracy.^{2}^{2} 2 We compute ${A}^{*}$ over the interval $[0.125,\mathrm{\dots},\mathrm{\hspace{0.17em}0.5}]$ as ${a}^{*}\cdot (0.50.125)$, where ${a}^{*}$ denotes the original validation accuracy. The value ${r}_{l}=1.0$ thus means no loss in accuracy, while ${r}_{l}=0.0$ indicates loss of all information relevant to classification. As mentioned before, we choose VGG16 as our loss network ${f}_{L}$. Denote by ${\varphi}_{i.j}$ the responses of the jth convolutional layer after activation in the ith block. In order to find a suitable choice of layers $\mathcal{I}$, we train RNN compression for three epochs on the ILSVRC2012 training set for the following choices of $\mathcal{I}$. We use the entire set of convolutional layers, denoted by ${\mathcal{I}}_{0}$, and compare against choosing ${\mathcal{I}}_{1}=\{{\varphi}_{1.1}\}$, ${\mathcal{I}}_{2}=\{{\varphi}_{5.1}\}$ and ${\mathcal{I}}_{3}=\{{\varphi}_{1.1},{\varphi}_{5.1}\}$. Table 1 summarizes our experiments. We see that choosing only a deep layer results in the best reconstruction quality for the loss network. However this choice is not optimal for other architectures, indicating that other layers are needed for the compression system to generalize to new classification architectures. However, using only the first layer also seems to be suboptimal, indicating that deeper layers store information that is crucial for classification. We see from the table that choosing the combination of early and deep layers, ${\mathcal{I}}_{3}$, results in the best compression quality in terms of preserved accuracy and yields a compression system that generalizes well across architectures. Finally, this evaluation indicates that we are able to improve over choosing the entire set of convolutions ${\mathcal{I}}_{0}$ by dropping the middle layers. From now on, we set $\mathcal{I}:=\{{\varphi}_{1.1},{\varphi}_{5.1}\}$ in (5). We refer the reader to the supplementary material for detailed plots.
From human visual perception to classification


In order to investigate the relation between compression quality perceived by humans in terms of MSSSIM, and by CNN classifiers, we train the compression networks with loss functions that interpolate between humanfriendly and classificationfriendly loss, i.e. for values of $\alpha $ in $\{0,\frac{1}{4},\frac{1}{2},\frac{3}{4},\mathrm{\hspace{0.17em}1}\}$. This tradeoff can be seen qualitatively in figure 2. Optimizing for MSSSIM, results in images that appear smoother and more blurry, especially for the lower bitrate. Classification optimized compression on the other hand results in sharper images but suffers from checkerboardlike artifacts. This type of degradation is a known issue for feature visualization and super resolution (see e.g. [34]) and – in our case – stems from the convolution based loss function which incurs artifacts in gradients. We notice that these differences are more pronounced in the low bitrate regime. In order to quantitatively investigate the tradeoff, we visualize the relation in figure 3. On the left axis we plot MSSSIM evaluated on Kodak, while on the right axis we show the preserved validation accuracy on ImageNet1K for DenseNet121 and InceptionResNetV2. The figures indicate that generally, we can indeed trade off accuracy against MSSSIM by optimizing compression with our proposed family of loss functions. Interestingly, we observe that by increasing the tradeoff parameter $\alpha $ from 0 to 0.25, we substantially increase accuracy, while the reduction in MSSSIM is relatively small. The same holds for the other direction. This relation is especially noticeable for DenseNet121 in the low bitrate regime. Furthermore, we observe that the effect of the choice of tradeoff parameter is much more pronounced in the low bitrate regime. Finally, we notice that the traditional codecs are more aligned with RNNH compression, indicating that they are indeed optimized for the human observer. We refer the reader to the supplementary material for more visual examples.
ImageNet classification
Figure 4 shows the accuracy curves for the InceptionResNetV2 architecture on ILSVRC2012 validation set compressed at different bitrates. Table 2 shows the classification accuracies for a wider collection of CNN architectures. We see that RNNC compression outperforms both the traditional codecs BPG, WebP and JPEG as well as RNN compression trained with MSSSIM, across all architectures and bitrates considered. In the case of the loss network VGG16 this is to be extpected, since we explicitly train the compression network to produce images whose VGGfeatures – which are fed to the fully connected layers for classification – match the ones from their original counterparts. Interestingly however, we see that the RNNC compression system generalizes well to architectures different from the loss network and maintains the accuracy remarkably well, also at very low bitrates. Figure 7 shows a sample image where our method performs worse than BPG compression in terms of classification. We provide the detailed accuracy curves for all classifiers in the supplementary material.
Finegrained visual categorization
In order to explore the generalization properties of our compression system to new tasks, we evaluate our method on two well known datasets for finegrained visual categorization, namely Stanford Dogs and the challenging CUB2002011. Note that the compression system is trained on the ILSVRC2012 training split, i.e. we look at the case where $\mathcal{X}$ and ${\mathcal{X}}^{\prime}$ in (4) are not equal. Figures 4(a) and 4(b) indicate that our classification oriented compression method outperforms both the traditional codecs and RNNH compression in terms of preserved classification accuracy with InceptionResNetV2 on both datasets. Similarly to ImageNet1K classification, we see that the difference is especially pronounced in the regime below 0.5 bpp. We provide evaluation plots for the other CNN architectures in the supplementary material.
Human visual perception
We evaluate the compression quality perceived by the human observer on the validation splits of the ILSVRC2012, Stanford Dogs and CUB2002011 dataset. We additionally evaluate on the widely used Kodak Photo CD dataset [25]. The procedure to compare the compression methods is as follows. On ILSVRC2012, Stanford Dogs and CUB2002011 we resize each image with bilinear interpolation such that the smallest side equals 256 pixels and aspects are preserved. We then take the central crop of size 256$\times $256. Since the Kodak images are all of equal resolution, we skip this first resizing step and keep the original resolution. Subsequently, we compress the images using a predefined grid of quality parameters and compute their bpp values which are then averaged over the entire validation set. Finally, we compute the MSSSIM scores between decoded and original (resized) image. This yields a set of (bpp, MSSSIM) points for each compression method. Figure 6 shows that human oriented RNNH compression clearly outperforms classification optimized RNNC compression across datasets and bitrates. Comparing our method against the traditional codecs, we see that RNNH slightly outperforms BPG on Kodak and Stanford Dogs and performs comparable to BPG on CUB2002011. On ILSVRC2012 however, BPG outperforms RNNH with respect to MSSSIM, especially for very low bitrates ($$ 0.5 bpp). On all datasets considered, RNNH clearly outperforms WebP and JPEG, while RNNC performs comparable to JPEG. Recall that, while we evaluate on different datasets, the compression system is trained on the ILSVRC2012 training set. Figure 8 shows a sample where RNNH performs worse than BPG in terms of MSSSIM.
5 Conclusion
In this paper we investigate the tradeoff in learned image compression with RNNs [43] between human visual perception and image classification. To that end, we propose a family of loss functions that on the one hand enables us to optimize compression for the human observer by directly maximizing MSSSIM. On the other hand – for a different value of the tradeoff parameter – we are able to train compression towards subsequent image classification. Our experiments have shown that when using the human friendly loss, RNN compression achieves performance comparable to the stateoftheart traditional codec BPG [5] and consistently outperforms both JPEG and WebP on four different datasets (namely Kodak, ILSVRC2012, Stanford Dogs and CUB2002011). Our classification friendly loss, which is based on deep features extracted from VGG16 trained for object recognition, induces a compression system which consistently and by a large margin outperforms all traditional codecs in terms of preserved classification accuracy. Our experiments furthermore indicate that our method is agnostic to the CNN architecture used for classification and does not require that the classifiers are finetuned on compressed image data, enabling the deployment of offtheshelf publicly available classifiers. This suggests that we can indeed leverage properties attributed to CNN classifiers for our purpose and thereby explicitly optimize image compression for classification. Finally, we find that by moving the loss function only marginially towards classification, we can substantially increase the preserved accuracy while incurring only a small reduction in MSSSIM, and vice versa. This improves compression in an environment where images are consumed by both humans and classifiers and allows a user to trade off reconstruction quality dependent on the observer.
Highlighting the limitations of our method, we first notice that introducing a feature reconstruction loss results in a need for labelled image data, in contrast to the traditionally unsupervised nature of learned image compression. Furthermore, we emphasize that we use MSSSIM as a model for image similarity perceived by humans. While being a widely adopted measure of distortion in the compression literature, it is only an approximation to a true model of similarity perceived by the human visual system.
Future works could investigate more computer vision tasks in light of machine oriented compression, and thereby explore the generalization properties of compression systems trained with our proposed loss functions. A further interesting line of future research could involve using a loss network in the feature reconstruction loss that was trained in a selfsupervised or unsupervised manner, eliminating the need for labelled data.
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool. Softtohard vector quantization for endtoend learning compressible representations. In Advances in Neural Information Processing Systems 30, pages 1141–1151. Curran Associates, Inc., 2017.
 [3] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool. Generative adversarial networks for extreme learned image compression. arXiv preprint arXiv:1804.02958, 2018.
 [4] J. Ballé, V. Laparra, and E. P. Simoncelli. Endtoend optimized image compression. In International Conference on Learning Representations (ICLR), 2017.
 [5] F. Bellard. Bpg image format, 2014. https://bellard.org/bpg/.
 [6] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, July 2017.
 [7] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, 2013.
 [8] J. Bruna, P. Sprechmann, and Y. LeCun. Superresolution with deep convolutional sufficient statistics. In International Conference on Learning Representations (ICLR), May 2016.
 [9] T. Chinen, J. Ballé, C. Gu, S. J. Hwang, S. Ioffe, N. Johnston, T. Leung, D. Minnen, S. O’Malley, C. Rosenberg, and G. Toderici. Towards a semantic perceptual image metric. In 2018 25th IEEE International Conference on Image Processing (ICIP), October 2018.
 [10] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [11] F. Chollet et al. Keras. https://keras.io, 2015.
 [12] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, New York, NY, USA, 2006.
 [13] S. Dodge and L. Karam. Understanding how image quality affects deep neural networks. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pages 1–6, 2016.
 [14] L. Gatys, A. S Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 262–270. Curran Associates, Inc., 2015.
 [15] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [16] Google. Webp image format. https://developers.google.com/speed/webp/, 2015. Accessed: 20190317.
 [17] L. Gueguen, A. Sergeev, B. Kadlec, R. Liu, and J. Yosinski. Faster neural networks straight from jpeg. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3933–3944. Curran Associates, Inc., 2018.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [19] A. G. Howard, M. Zhu, .o Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861, 2017.
 [20] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [21] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. In European Conference on Computer Vision (ECCV), pages 694–711, 2016.
 [22] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, . Jin Hwang, J. Shor, and G. Toderici. Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [23] A. Khosla, N. Jayadevaprakash, B. Yao, and L. FeiFei. Novel dataset for finegrained image categorization. In First Workshop on FineGrained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, June 2011.
 [24] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), May 2015.
 [25] E. Kodak. Kodak lossless true color image suite (PhotoCD PCD0992).
 [26] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems, pages 253–256, 2010.
 [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photorealistic single image superresolution using a generative adversarial network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [28] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma. Deep image compression via endtoend learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
 [29] Z. Liu, T. Liu, W. Wen, L. Jiang, J. Xu, Y. Wang, and G. Quan. Deepnjpeg: A deep neural network favorable jpegbased image compression framework. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6, 2018.
 [30] Z. Liu, X. Xu, T. Liu, Q. Liu, Y. Wang, Y. Shi, W. Wen, M. Huang, H. Yuan, and J. Zhuang. Machine vision guided 3d medical image compression for efficient transmission and accurate segmentation in the clouds. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [31] S. Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A, 374(2065), 2016.
 [32] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Conditional probability models for deep image compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [33] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Practical full resolution learned lossless image compression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [34] A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. Distill, 2016.
 [35] O. Rippel and L. Bourdev. Realtime adaptive image compression. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 2922–2930, Aug. 2017.
 [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [37] S. Santurkar, D. Budden, and N. Shavit. Generative compression. 2018 Picture Coding Symposium (PCS), pages 258–262, 2018.
 [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), May 2015.
 [39] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, AAAI’17, pages 4278–4284. AAAI Press, 2017.
 [40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [41] L. Theis, W. Shi, A. Cunningham, and F. Huszár. Lossy image compression with compressive autoencoders. In International Conference on Learning Representations (ICLR), 2017.
 [42] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, S. Baluja, M. Covell, and R. Sukthankar. Variable rate image compression with recurrent neural networks. In International Conference on Learning Representations (ICLR), 2016.
 [43] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell. Full resolution image compression with recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5435–5443, July 2017.
 [44] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool. Towards image understanding from deep compression without decoding. In International Conference on Learning Representations (ICLR), Apr. 2018.
 [45] M. Tschannen, E. Agustsson, and M. Lucic. Deep generative models for distributionpreserving lossy compression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, pages 5929–5940. Curran Associates, Inc., 2018.
 [46] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 Dataset. Technical Report CNSTR2011001, California Institute of Technology, 2011.
 [47] G. K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii–xxxiv, 1992.
 [48] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The ThirtySeventh Asilomar Conference on Signals, Systems and Computers, 2003, volume 2, pages 1398–1402, 2003.