A Discriminative Learned CNN Embedding for Remote Sensing Image Scene Classification

  • 2019-12-02 08:52:39
  • Wen Wang, Lijun Du, Yinxing Gao, Yanzhou Su, Feng Wang, Jian Cheng
  • 0

Abstract

In this work, a discriminatively learned CNN embedding is proposed for remotesensing image scene classification. Our proposed siamese network simultaneouslycomputes the classification loss function and the metric learning loss functionof the two input images. Specifically, for the classification loss, we use thestandard cross-entropy loss function to predict the classes of the images. Forthe metric learning loss, our siamese network learns to map the intra-class andinter-class input pairs to a feature space where intra-class inputs are closeand inter-class inputs are separated by a margin. Concretely, for remotesensing image scene classification, we would like to map images from the samescene to feature vectors that are close, and map images from different scenesto feature vectors that are widely separated. Experiments are conducted onthree different remote sensing image datasets to evaluate the effectiveness ofour proposed approach. The results demonstrate that the proposed methodachieves an excellent classification performance.

 

Quick Read (beta)

A DISCRIMINATIVELY LEARNED CNN EMBEDDING FOR REMOTE SENSING IMAGE SCENE CLASSIFICATION

Wen Wang1, Lijun Du2, Yinxing Gao1, Yanzhou Su1, Feng Wang1, Jian Cheng1
1University of Electronic Science and Technology of China,
School of Information and Communication Engineering
2Leshan Normal University, School of Computer Science
[email protected], [email protected]
Abstract

In this work, a discriminatively learned CNN embedding is proposed for remote sensing image scene classification. Our proposed siamese network simultaneously computes the classification loss function and the metric learning loss function of the two input images. Specifically, for the classification loss, we use the standard cross-entropy loss function to predict the classes of the images. For the metric learning loss, our siamese network learns to map the intra-class and inter-class input pairs to a feature space where intra-class inputs are close and inter-class inputs are separated by a margin. Concretely, for remote sensing image scene classification, we would like to map images from the same scene to feature vectors that are close, and map images from different scenes to feature vectors that are widely separated. Experiments are conducted on three different remote sensing image datasets to evaluate the effectiveness of our proposed approach. The results demonstrate that the proposed method achieves an excellent classification performance.footnotetext: W. Wang, Y. Gao, Y. Su, F. Wang and J. Cheng are with the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China, 611731. L. Du is with the school of computer science, University of Leshan Normal. This research has been supported by the National Natural Science Fundation of China under Grant 61671125. The Fundamental Research Funds for the Central Universities NO.2672018ZYGX2018J008. (Corresponding author: [email protected])

Remote sensing image, scene classification, convolutional neural networks (CNNs), siamese network.

I Introduction

With the development of remote sensing technology, large amounts of high-resolution images are becoming increasingly available and remote sensing image classification has attracted much attention in recent years[15],[6],[17]. Convolutional Neural Network (CNN) has shown great potential for remote sensing image scene classification. However, scene classification is still a challenging problem because of the large variability of scale, orientation, illumination, viewpoint and layout of scene in the images.

Fig. 1: The structure of our siamese network: a pair of scene images are fed into two deep CNN models for feature embedding. And then, the output two feature embeddings are used to predict the classes of the two input images, respectively, and also measure the distance jointly. Finally, the network optimizes the three objectives.

During the past few years, there has been increased interest in developing a variety of methods to deal with remote sensing image scene classification. Traditional methods design the hand-craft features, such as color, texture and shape, including SIFT[7], LBP[1017623], HOGs[4] and Gist [9], which are the primary characteristics of a scene image. Then, the descriptors based on visual dictionaries (e.g., Bag of Visual Words (BoVW) model)[16], attracted the attention and were widely used for feature encoding, in which the input is a set of handcrafted features and the output is a set of learned features. In recent years, with the development of deep learning techniques, especially Convolutional Neural Network (CNN), deep feature learning-based methods have shown high feature representativeness and generalization capability for remote sensing image scene classification[8], [3], [10]. In deep learning-based methods, a robust and discriminative descriptor is usually employed to represent each input image as a feature vector, where different scene classes are expected to be separated as much as possible in the feature space. However, this model has a relatively weak constraint on features extracted from the same scene, since dissimilar features of the same scene classes would be mapped to different scene classes, which lead to low scene classification accuracy.

In this paper, we use metric learning loss function to learn from the labeled training images to effectively measure the distance of scene samples, under which the distance of inter-class input pairs are enlarged and that of intra-class input pairs are reduced as much as possible to improve the discriminative ability of the learned embedding for remote sensing image classification. The proposed network is a siamese network architecture [5] that predicts scene classes and measures the distance at the same time. Compared to the previous models, we take full advantages of the labeled training data in terms of pair-wise distance similarity and image classes. To summarize, our contributions are two-fold:

(1) We propose to use metric learning loss function to learn a discriminative CNN embedding by the siamese loss, which learns the class for each scene and penalizes the distance between the deep features and their corresponding labels.

(2) Our proposed method significantly outperforms the state-of-art methods on three remote sensing datasets: Brazilian Coffee Scenes [10], UCMerced LandUse [14] and NWPU-RESCISC45 [3].

The rest of this paper is organized as follows. Section II introduces the proposed method in detail. Section III presents the experimental results on three publicly available scene datasets. At last, we conclude our works in Section IV.

II Our approach

The goal of our proposed method is to extract the features from training images and compute the distance between images with a discriminative metric for accurate image classification. Fig.1 shows an overview of the siamese network for remote sensing image scene classification.

II-A Overall Network

Our network is basically a convolutional siamese network, which consists of two sub-networks with shared weights. Given an input pair of scene images, the proposed network simultaneously predicts the classes and the distance measurement of the two input images.

As shown in Fig.1, the images are processed by a Convolutional Neural Network (CNN). The CNN involves many individual processing steps, so we refer to the complete CNN as a function, f=C(x,θc), that takes an image x as input and produces a vector f as output, where f is the vectorised representation of the CNN’s final layer activation maps and θc denotes ConvNets parameters to be learned. In this work, we use GoogleNet[12] as the base CNN architecture, which was pre-trained on ImageNet [11].

II-B Siamese Loss

Our siamese network trains the feature extraction network to optimize both the classification loss to predict the scene classes and the distance learning loss to estimate similarity.

The first is the classification loss function, which classifies remote sensing images into one of n different classes. The classification loss function is achieved by following the deep CNN with an n-way softmax layer, which outputs a probability distribution over the n classes. The network is trained to minimize the softmax function, or cross-entropy loss, which is denoted as,

zi=WiTf+bi, (1)
Pc=P(q=c|f)=exp(zc)k=1nexp(zk), (2)
I(f,θi)=-logPc, (3)

where W is the softmax weight matrix, b is the bias matrix, and c is the target class, Pc is the predicted probablity, and f is the image feature vector, θi denotes the softmax layer parameters.

The second is the distance learning loss function, which encourages features extracted from scene images of the same classes to be similar and enlarges the margin between the features from different classes in the feature space. Given a pair of images (si,sj), where each image has been processed using the deep CNN feature extraction network to give image feature vectors, fi=C(si) and fj=C(sj), where C() is the feature extraction function defined by the CNN. The high-level feature from the fine-tuned CNN has shown a discriminative ability and it is more compact than the activations in the intermediate layers[2]. So we directly compares the high-level features fi, fj for the distance estimation. We can write the siamese network training objective as a function of the feature vectors fi and fj as follows,

V(fi,fj,θv)={12||fi-fj||22i=j,12[max(m-||fi-fj||2,0)]2ij. (4)

where ||fi-fj||22 is the Euclidean distance between the feature vectors. When two images are from the same scene classes (i=j), the objective V encourages the features fi and fj to be close by minimizing the L2 distance between the two vectors. While for images from different scene classes (ij), it encourages the distance larger than a margin m. θv is the parameter to be learned in the above training objective function.

We can now define the overall training objective L for a single pair of images(si,sj), which jointly optimizes the classification loss function and the distance learning loss function to train the CNN for discriminative feature learning. The formulation is given as follows,

L(si,sj)=I(C(si))+I(C(sj))+λV(C(si),C(sj)), (5)

where λ is used for balancing the two loss functions.

II-C Optimization

Our goal is to learn the parameters θc in the feature extraction function C, while θi and θv are only parameters introduced to propagate the classification loss and the distance learning loss during the training stage. In the testing stage, only θc is used for feature extraction. The parameters are updated by stochastic gradient descent.

Then the gradient of L with respect to fi and fj are given by,

Lfi=I(fi,θi)fi+λV(fi,fj,θv)fi, (6)
Lfj=I(fj,θi)fj+λV(fi,fj,θv)fj. (7)

The gradients of V with respect to fi is defined by the following relation, and the gradients of V with respect to fj is symmetric with that of fi. The derivatives are used to update the parameters of deep nueral networks.

{Vfi=fi-fj2fi-fj2fi=fi-fji=j,Vfi=-max[(m-fi-fj2),0]fi-fjfi-fj2ij. (8)

The gradients of I with respect to zi is illustrated as follows,

Izi={Pi-1i=c,Piic. (9)

At last, the network updates the parameters of θi, θv and θc as, θi=θi-ηLθi, θv=θv-ηLθv and θc=θc-ηLθc, where η is the learning rate.

III Experiments

In order to better evaluate the effectiveness of our proposed model, we have chosen three remote sensing datasets with different visual properties. The details about the three datasets and the experiments implementation are presented in the following subsections. Finally, we present and discuss the experimental results.

Fig. 2: Some example images from the three datasets.

III-A Dataset

UCMerced LandUse. The UCMerced LandUse dataset [14] is one of the first publicly avaliable high-resolution remote sensing imagery datasets. This dataset contains 2100 aerial scene images with 256×256 pixels equally divided into 21 land-use classes. Fig.2 (c) shows some examples of ground truth images from three classes in this dataset.

Brazilian Coffee Scenes. This dataset [10] includes multi-spectral scenes taken by the SPOT sensor. It contains 2876 images with 64×64 pixels equally divided into 2 classes (coffee and non-coffee). Fig.2 (a) shows some examples of this dataset. It has many intra-class variances caused by different crop management techniques.

NWPU-RESISC45. This dataset [3] contains 31500 images, covering 45 scene classes. Each class consists of 700 images with the size of 256×256 pixels. Fig.2 (b) shows some samples of two classes from this dataset. This dataset is one of the largest scale on the number of the scene classes and the total number of images.

III-B Implementation details

TABLE I: Classification accuracy achieved on three different datasets.
Dataset Method Accuracy
UCMerced LandUse GoogleNet-Basel.[8] 95.47%
Ours(GoogleNet) 96.26%
Brazilian Coffee GoogleNet-Basel.[8] 92.11%
Ours(GoogleNet) 93.65%
NWPU- RESISC45 GoogleNet-Basel.[3] 86.02%
Ours(GoogleNet) 89.16%

For UCMerced LandUse dataset, all the images are randomly cropped to 227×227 and mirrored horizontally during training. For Brazillian Coffee Scenes and NWPU-RESISC45 dataset, we randomly crop images to 64×64 and 227×227 respectively. The mean image computed from all the training images is substracted from all the images. Besides, we shuffle the dataset and use a random order of the images. The margin m in our experiment is set to 1. All experiments were performed on a 64 bits Intel i7-4790 machine with 32GB of RAM memory. A GeForce GTX 1080Ti with 11GB of memory and Ubuntu version 14.04.1 LST were used in our experiments.

III-C Classification Evaluation

Comparison with the CNN baseline. We trained the baseline networks without the distance measurement. The baseline nteworks were pretrained on ImageNet and fine-tuned to predict the scene classes. As shown in TABLE I, the classification accuracy can achieve 96.26%, 93.65% and 89.16% on UCMerced LandUse, Brazilian coffee and NWPU-RESISC45 dataset seperately which are better than 95.47%, 92.11% and 86.02% obtained by the baseline network on GoogleNet. The results show that: (1) distance measurement used in remote sensing image has positive effects on improving the performance of classification accuracy. (2) our proposed method can work with different datasets and improve their results. (3) the proposed model helps the network to learn more discriminative features.

Comparison with different values of λ. λ is a tradeoff parameter to balance the contribution of the classification loss and the distance learning loss. The value of λ affects the classification accuracy. As shown in Fig.3, the results show that: (1) distance measurement has positive effects on improving the classification accuracy. (2) a larger λ means higher accuracy, while excessively larger λ decreases the performance. For the reason that, if λ is too large, too much attention is paid to the distance measurement, and the classification prediction is ignored.

Fig. 3: The effects of tradoff parameter λ on accuracy.
Fig. 4: Feature embedding visualizations of the baseline (first line) and our proposed embedding (second line) on the test splits of the three scene datasets using t-SNE[13].

Feature embedding. We qualitatively evaluate our learned features to verify if it is a good generic feature. We extract the fully-connected layer features from test split of the three datasets. These features are then projected to 2-dimensional space using t-SNE[13]. Fig.4 shows the feature visualization on the learned embedding of our model and the baseline CNN. Under the single classification signal of cross-entropy loss, the features are less seperated than our proposed method with both classification loss and distance metric learning loss. If we only use the softmax loss, the resulting deeply learned features would contain large intra-class variations. While joint the softmax loss and the distance metric learning loss achieve good intra-class compactness and inter-class separability, which is very beneficial to the discriminative feature learning.

IV Conclusion

In this paper, we have proposed to combine class prediction and distance measurement to improve the performance for remote sensing image scene classification. Our proposed method takes full advantage of the similarity information in training samples and learns a discriminative embedding. It outperforms the CNN baseline on three different remote sensing datasets and shows the effectiveness of the proposed method on improving the remote sensing image classification performance.

References

  • [1] Chen Chen, Baochang Zhang, Hongjun Su, Wei Li, and Lu Wang. Land-use scene classification using multi-scale completed local binary patterns. Signal Image & Video Processing, 10(4):745–752, 2016.
  • [2] Yuheng Chen, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identification-verification. In International Conference on Neural Information Processing Systems, pages 1988–1996, 2014.
  • [3] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  • [4] Navneet Dalal and Bill Triggs. Histograms of Oriented Gradients for Human Detection. In Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Computer Vision & Pattern Recognition (CVPR ’05), volume 1, pages 886–893, San Diego, United States, June 2005. IEEE Computer Society.
  • [5] R. Hadsell, S. Chopra, and Y. Lecun. Dimensionality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1735–1742, 2006.
  • [6] Fan Hu, Gui Song Xia, Jingwen Hu, and Liangpei Zhang. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sensing, 7(11):14680–14707, 2015.
  • [7] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, Nov 2004.
  • [8] Keiller Nogueira, Otávio A. B. Penatti, and Jefersson A. Dos Santos. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognition, 61:539–556, 2017.
  • [9] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
  • [10] Otavio A. B. Penatti, Keiller Nogueira, and Jefersson A. Dos Santos. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 44–51, 2015.
  • [11] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [12] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. pages 1–9, 2014.
  • [13] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing data using t-sne. 2008.
  • [14] Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Sigspatial International Conference on Advances in Geographic Information Systems, pages 270–279, 2010.
  • [15] Fan Zhang, Bo Du, and Liangpei Zhang. Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience & Remote Sensing, 53(4):2175–2184, 2014.
  • [16] Qiqi Zhu, Yanfei Zhong, Bei Zhao, Gui Song Xia, and Liangpei Zhang. Bag-of-visual-words scene classifier with local and global features for high spatial resolution remote sensing imagery. IEEE Geoscience & Remote Sensing Letters, 13(6):747–751, 2017.
  • [17] Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience & Remote Sensing Letters, 12(11):2321–2325, 2015.