Abstract

Recent research has widely explored the problem of aesthetics assessment ofimages with generic content. However, few approaches have been specificallydesigned to predict the aesthetic quality of images containing human faces,which make up a massive portion of photos in the web. This paper introduces amethod for aesthetic quality assessment of images with faces. We exploit threedifferent Convolutional Neural Networks to encode information regardingperceptual quality, global image aesthetics, and facial attributes; then, amodel is trained to combine these features to explicitly predict the aestheticsof images containing faces. Experimental results show that our approachoutperforms existing methods for both binary, i.e. low/high, and continuousaesthetic score prediction on four different databases in the state-of-the-art.

Quick Read (beta)

AESTHETICS ASSESSMENT OF IMAGES CONTAINING FACES

Abstract

Recent research has widely explored the problem of aesthetics assessment of images with generic content. However, few approaches have been specifically designed to predict the aesthetic quality of images containing human faces, which make up a massive portion of photos in the web. This paper introduces a method for aesthetic quality assessment of images with faces. We exploit three different Convolutional Neural Networks to encode information regarding perceptual quality, global image aesthetics, and facial attributes; then, a model is trained to combine these features to explicitly predict the aesthetics of images containing faces. Experimental results show that our approach outperforms existing methods for both binary, i.e. low/high, and continuous aesthetic score prediction on four different databases in the state-of-the-art.

\usetikzlibrary

arrows

Simone Bianco, Luigi Celona, Raimondo Schettini

Department of Informatics, Systems and Communication

University of Milano-Bicocca

viale Sarca, 336, 20126, Milano, Italy

{tikzpicture}

[remember picture,overlay] \node[anchor=south,yshift=10pt] at (current page.south) ©2018 IEEE. Published in the IEEE 2018 International Conference on Image Processing (ICIP 2018), scheduled for 7-10 October 2018 in Athens, Greece. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. ; Index Terms— Image aesthetics, Faces, Convolutional neural networks, Genetic algorithms

1 Introduction

Automatic image aesthetic assessment is a challenging task due to its fuzzy definition and its highly subjective nature. It represents an important criterion for visual content curation and it is useful in many applications such as image retrieval [1, 2], photo enhancement [3], and image cropping [4, 5, 6]. Aesthetic assessment of images with generic content has been addressed in [6, 7, 8]. However, psychology research [9] showed that certain kinds of content are more attractive than others. In fact, professional photographers adopt different photographic techniques and have various aesthetic criteria in mind when taking different types of photos. Therefore, it is reasonable to design features specialized in modeling aesthetic quality for different kinds of photos (e.g. [10]).

In this paper we focus on aesthetic assessment of images containing human faces. The reasons are twofold: this category of photos makes up an important part of images on social media sites and media content repositories [11, 12], and we have observed that the performance of generic content aesthetic assessment methods [7] drop considerably when dealing with this type of images. It should be clear that although facial beauty and face aesthetics are two related concepts, the first reflects the attractiveness of the subject’s face, while the second represents the attractiveness of the photo containing the subject’s face (see for example Fig. 1).

Fig. 1: Face aesthetics represents the attractiveness of the photo shot. This takes into account aspects such as: facial expressions, brightness, contrast, etc.

Li et al. [13] evaluated the performance of several categories of features related to aesthetics such as pose, face locations and photo composition on their own dataset of photos with faces. Males et al. [14] exploited a support vector machine for aesthetic quality categorization trained on the combination of global (e.g. contrast and hue distribution of the whole image) and local features (e.g. sharpness and blown-out highlights only of facial region). Their experiments have been carried out on a set of photo collected from Flickr and manually labeled by five people as being aesthetically appealing or not. Lienhard et al. [15, 16] proposed a new database, called Human Faces Score (HFS), and developed a method based on the selection of low-level features extracted from several regions for both aesthetic quality categorization of portrait images (i.e. low or high) and continuous aesthetic score prediction. Recently, in [17] a compositional-based augmentation scheme has been used to train a deep convolutional neural network (DCNN) on a portrait subset of the AVA dataset for binary aesthetic classification.

2 Facial image aesthetic estimation

Fig. 2: Overview of the proposed method.

In this section we describe the proposed method for aesthetic quality assessment of images with faces. The proposed method is depicted in Fig. 2: given a photo, first the largest face is detected, then features are extracted from the whole image and the face region, and finally the trained model is applied for aesthetic quality estimation of the photo.

2.1 Face detection

DLib’s face detector [18] is used to localize the face region. The size of the detected bounding box is then increased of 10% in order to include also a portion of the shoulders.

2.2 Features extraction

Aesthetic quality of photos with generic content as well as the aesthetics of photos with faces depend upon several perceptual properties. Furthermore, face attributes provide fundamental information for the aesthetic evaluation of this specific category of photos. In this paper, we use state-of-the-art CNNs for encoding both perceptual image-related and face properties.

Perceptual features. As highlighted in many previous works, aesthetic quality is strongly influenced by several dimensions such as composition, colorfulness, spatial organization, emphasis, and depth. We consider two pre-trained CNNs for image quality assessment and generic content aesthetics assessment, proposed in authors’ previous works, in order to encode such information about the whole image (face and background). Specifically, the DeepBIQ model [19] (shortly IQ), that is a CNN model trained for blind image quality assessment, is considered for encoding perceptual quality metrics such as noise, exposure, quality, JPEG quality, and sharpness. While, the DeepIA model [7] (shortly IA), which is a CNN trained for generic content aesthetic assessment, is used to extract features related to global image aesthetics concepts, such as brightness, contrast, color, etc.

Both IA and IQ are 4,096-dimensional feature vectors obtained by considering the activation of the last fully-connected layer immediately before the regression layer.

Facial features. In photos containing faces, observers mainly focus on face regions. Intuitively, face attributes such as facial expressions, the presence of makeup or the presence of accessories are closely related to the aesthetics of this specific category of photos. Therefore, we consider a set of features able to accurately describe the face. To this aim, we use the Alignment-Free Facial Attribute Classification Technique (AFFACT) [20], shortly FA, a CNN model trained for the estimation of 40 facial attributes. The 2,048-dimensional vector corresponding to the activations of the fully-connected layer before the classification layer are used as features.

2.3 Features fusion and learning procedure

Previously extracted features are fused and then exploited for the learning procedure following two different strategies.

The first includes linear concatenation as fusion technique, followed by a linear support vector machine (SVM) trained to estimate the portrait aesthetic quality. Since the resulting feature vectors have a huge number of features ( $10,240$ when all the features are concatenated), some of which might be redundant, the second strategy proposed also includes a feature selection step. Feature selection refers to the task of identifying relevant features useful to fit accurate models. In this work, we propose a method based on genetic algorithms (GA) to jointly identify a subset of features from the whole feature vector and to optimize a prediction model. The GA is build to solve a mixed integer problem where some variables are restricted to take only integer values. Real-valued variables are the weights of the linear model which maps features to an aesthetic prediction, while the boolean-valued variables discriminate relevant features from the non-relevant ones. A chromosome is then represented as $(i_{0}i_{j}...i_{N_{f}},r_{0}r_{j}...r_{N_{f}},b)$ , where $i_{j}\Rightarrow\{x\in\mathbb{Z}:0\leq x\leq 1\}$ are binary values coordinating features selection, $r_{j}\in\mathbb{R}$ are the weights, $b\in\mathbb{R}$ is the bias, $x_{j}$ are the features, $j\in[0,N_{f}]$ , and $N_{f}$ is the total number of features. Aesthetic quality is predicted through the following equation:

\displaystyle pred=\sum_{j=0}^{N_{f}}{x_{j}(i_{j}r_{j})+b}.

(1)

The fitness function used for classification tries to minimize the Hinge loss, while the fitness function for regression is the $\text{Smooth-L}_{1}$ loss (defined in [21]).

3 Experiments

In this section, the evaluation procedure, the considered databases, the experiments and the results are detailed.

3.1 Performance evaluation

For the experiments the same evaluation procedure adopted in [16] is followed. More in detail, for each experiment 10-fold cross validation is performed by randomly selecting the training and testing images. This procedure is repeated 10 times to avoid sampling bias.

Classification performance is evaluated in terms of Good Classification Rate (GCR) that is defined as the ratio between the number of images correctly classified and the number of test images. This is equal to compute classification accuracy.

Regression performance is evaluated in terms of Pearson’s Linear Correlation Coefficient (LCC) between the predicted and the ground-truth aesthetic scores. The average of both GCR and LCC across the 10 rounds is reported.

3.2 Portrait images databases

In this section the publicly available databases for aesthetic assessment of images with faces are described. Databases consist of images containing people or groups of people gathered from online photo databases or photo sharing websites (e.g. Flickr, DPChallenge). Given that these photos are collected in real scenarios they present a wide range of subjects, facial appearance, illumination and imaging conditions.

CUHKPQ. The CUHKPQ [22] is a database manually annotated for image aesthetics categorization (respectively high and low). It consists of 17,673 images organized in seven different categories. In this work, only images belonging to the “human” category are considered. There are 3,148 photos and some sample images are shown in Figure 2(a).

Human Faces Scores (HFS). The Human Faces Scores (HFS) [15] database contains 250 headshot photos. Specifically, 7 images of 20 different people, and 110 additional portrait images have been collected. Face images of one subject are given in Figure 2(b). Each image has been rated by 25 human observers on a scale with values ranging between 1 and 6 (the highest aesthetic quality).

Face Aesthetics Visual Analysis. The Face Aesthetics Visual Analysis (FAVA) database is a subset of the large-scale AVA dataset [23] containing various images with faces. Each picture is associated with a value between 1 and 10 (the highest quality) corresponding to the average of around 210 collected individual scores. Samples are shown in Figure 2(c).

Flickr database. The Flickr database has been gathered on Flickr for general aesthetic assessment [1]. It consists of 500 images associated to a ground-truth score between 0 and 10, where 10 means high quality. Photos are either portraits or group of faces. According to [16] only the biggest detected face is considered in each picture. Figure 2(d) shows samples from the database.

(a) Face images from the CUHKPQ database.

(b) Face images of one subject from the HFS database.

(d) Face images from the Flickr database.

Fig. 3: Examples of face images from the considered databases.

Table 1: GCR (%) of the aesthetic quality categorization for each database by extracting perceptual features from the whole image.

IQ	IA	FA	#features	GA	GCR (%)
IQ	IA	FA	#features	GA	CUHKPQ	FAVA	Flickr
✓			4,096		93.2	63.6	64.3
	✓		4,096		97.2	67.4	71.6
		✓	2,048		97.0	70.0	66.2
✓		✓	6,144		97.2	70.0	67.6
✓	✓		8,192		97.4	63.0	73.6
	✓	✓	6,144		98.2	71.2	73.6
✓	✓	✓	10,240		98.2	71.2	74.0
✓	✓	✓	8,300	✓	97.5	70.7	73.9

Table 2: LCC of the aesthetic quality prediction for each database by extracting perceptual features from the whole image.

IQ	IA	FA	#features	GA	LCC
IQ	IA	FA	#features	GA	FAVA	Flickr
✓			4,096		0.38	0.36
	✓		4,096		0.51	0.57
		✓	2,048		0.55	0.48
✓		✓	6,144		0.57	0.51
✓	✓		8,192		0.36	0.56
	✓	✓	6,144		0.62	0.62
✓	✓	✓	10,240		0.61	0.61
✓	✓	✓	10,229	✓	0.62	0.61

3.3 Experimental results

In this section, experimental setup and results are detailed. Binary aesthetic classification and aesthetic score regression are performed for each dataset previously presented. For classification, datasets are separated in two equally distributed groups (except CUHKPQ which is already separated by labels), containing respectively the images with the lowest and highest aesthetic scores. For experiments based on the use of feature concatenation and SVM, we employ a linear SVM for binary classification while a linear Support Vector Regressor machine (SVR) is used for continuous aesthetic score prediction. We report the performance obtained by considering a single feature vector at time and then by all of their possible combinations. In the experiments involving the use of GA, all the feature vectors are linearly concatenated. For both classification and regression, the GA is trained with a population of 100 individuals initialized by using parameters (weights and bias) and their perturbed versions of the SVM previously learned for aesthetic prediction. The learning parameters are empirically setup differently for classification and regression. More precisely, for classification the number of generations is 200, the probability of crossover is 80%, and the elitism is 7%. For regression, the number of generation is 250, the crossover probability is 85%, and finally the elitism is 10%.

In order to evaluate how the context (background) influences the aesthetic judgement of images with faces, we perform two sets of experiments. In the first set, perceptual features are extracted from the whole image as previously described, while in the second set these features are extracted considering only the face region.

Experiments considering the whole image. Results for binary aesthetic classification are reported in Table 1. The combination of all the considered features achieved the best results for all the databases and performance results by the GA are very close but using a smaller set of features. Performance results for continuous aesthetic score are in Table 2. The best correlation is achieved for both FAVA and Flickr by fusing image aesthetics and facial attributes features.

Experiments considering only face region. Results for binary aesthetic classification are reported in Table 3. The performance for the FAVA dataset is higher than the one obtained by extracting features from the whole image. The reason might be that many images contain a small portion of background. Performance results (in Table 4) for continuous aesthetic score confirm that the fusion of all the features is optimal and that the GA-based solution obtains comparable results by using a smaller amount of features.

Table 3: GCR (%) of the aesthetic quality categorization for each database by extracting perceptual features from face region.

IQ	IA	FA	#features	GA	GCR (%)
IQ	IA	FA	#features	GA	CUHKPQ	HFS	FAVA	Flickr
✓			4,096		92.0	72.4	63.3	59.1
	✓		4,096		95.0	73.8	66.5	64.5
		✓	2,048		97.0	71.0	70.0	66.2
✓		✓	6,144		97.0	76.8	70.8	67.2
✓	✓		8,192		95.4	75.1	66.3	65.0
	✓	✓	6,144		97.1	78.0	71.7	65.4
✓	✓	✓	10,240		97.0	79.0	71.8	65.6
✓	✓	✓	8,283	✓	96.1	79.0	71.1	66.5

Table 4: LCC of the aesthetic quality prediction for each database by extracting perceptual features from face region.

IQ	IA	FA	#features	GA	LCC
IQ	IA	FA	#features	GA	HFS	FAVA	Flickr
✓			4,096		0.59	0.39	0.32
	✓		4,096		0.66	0.50	0.48
		✓	2,048		0.67	0.55	0.48
✓		✓	6,144		0.71	0.56	0.49
✓	✓		8,192		0.68	0.51	0.47
	✓	✓	6,144		0.74	0.62	0.51
✓	✓	✓	10,240		0.74	0.61	0.51
✓	✓	✓	10,087	✓	0.76	0.61	0.51

Table 5 shows the comparison with state-of-the-art methods. We report results for the best solution on all the datasets corresponding to the combination of all the considered features extracted from the whole image. For all datasets, on average we improve GCR by more than 3% with respect to the previous methods for binary aesthetic classification. The improvement in terms of LCC is more than 8% on average.

Table 5: Comparison with state-of-the-art methods for both aesthetic categorization and score prediction for all the considered databases.

Methods	CUHKPQ	HFS		FAVA		Flickr
Methods	GCR (%)	GCR (%)	LCC	GCR (%)	LCC	GCR (%)	LCC
Lienhard [16]	94.8	79.3	0.73	67.1	0.51	69.3	0.49
Kairanbay [17]	-			65.3	-	-	-
Proposed	98.2	79.0*	0.76*	71.2	0.61	74.0	0.61
*These results are obtained by extracting perceptual features from face region.

4 Conclusions

In this work, we propose a framework for the automatic estimation of the aesthetic quality of images containing faces. This work extends our generic-content aesthetic assessment framework specializing it for photo containing faces. We use three different CNNs to encode global image aesthetics, perceptual quality and facial attributes. A novel learning procedure based on genetic algorithms is then applied for the combination of CNNs features and image aesthetic prediction. We evaluate the proposed algorithm in both binary and continuous aesthetic score prediction tasks on four benchmark datasets achieving state-of-the-art performances.

5 Acknowledgments

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.

References

[1] Congcong Li, Alexander C Loui, and Tsuhan Chen, “Towards aesthetics: A photo quality assessment and photo selection system,” in International Conference on Multimedia. ACM, 2010, pp. 827–830.
[2] Vassilios Vonikakis, Ramanathan Subramanian, Jonas Arnfred, and Stefan Winkler, “A probabilistic approach to people-centric photo selection and sequencing,” IEEE Transactions on Multimedia, vol. 19, no. 11, pp. 2609–2624, 2017.
[3] Subhabrata Bhattacharya, Rahul Sukthankar, and Mubarak Shah, “A framework for photo-quality assessment and enhancement based on visual aesthetics,” in International Conference on Multimedia. ACM, 2010, pp. 271–280.
[4] Gianluigi Ciocca, Claudio Cusano, Francesca Gasparini, and Raimondo Schettini, “Self-adaptive image cropping for small displays,” IEEE Transactions on Consumer Electronics, vol. 53, no. 4, pp. 1622–1627, 2007.
[5] Simone Bianco and Gianluigi Ciocca, “User preferences modeling and learning for pleasing photo collage generation,” ACM Transactions on Multimedia Computing Communications and Applications, vol. 12, no. 1, pp. 1–23, 2015.
[6] Bin Jin, Maria V Ortiz Segovia, and Sabine Süsstrunk, “Image aesthetic predictors based on weighted cnns,” in ICIP. IEEE, 2016, pp. 2291–2295.
[7] Simone Bianco, Luigi Celona, Paolo Napoletano, and Raimondo Schettini, “Predicting image aesthetics with deep learning,” in ACIVS. Springer, 2016, pp. 117–125.
[8] Yueying Kao, Ran He, and Kaiqi Huang, “Deep aesthetic quality assessment with semantic information,” IEEE Transactions on Image Processing, vol. 26, no. 3, pp. 1482–1495, 2017.
[9] Michael Freeman et al., The Photographer’s Eye: Composition and Design for Better Digital Photos, CRC Press, 2007.
[10] Wei Luo, Xiaogang Wang, and Xiaoou Tang, “Content-based photo quality assessment,” in ICCV. IEEE, 2011, pp. 2206–2213.
[11] Simone Bianco and Raimondo Schettini, “Adaptive color constancy using faces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1505–1518, 2014.
[12] Saeideh Bakhshi, David A Shamma, and Eric Gilbert, “Faces engage us: Photos with faces attract more likes and comments on instagram,” in Conference on Human Factors in Computing Systems. ACM, 2014, pp. 965–974.
[13] Congcong Li, Andrew Gallagher, Alexander C Loui, and Tsuhan Chen, “Aesthetic quality assessment of consumer photos with faces,” in ICIP. IEEE, 2010, pp. 3221–3224.
[14] Matija Males, Adam Hedi, and Mislav Grgic, “Aesthetic quality assessment of headshots,” in International Symposium ELMAR. IEEE, 2013, pp. 89–92.
[15] Arnaud Lienhard, Marion Reinhard, Alice Caplier, and Patricia Ladret, “Photo rating of facial pictures based on image segmentation,” in VISAPP. IEEE, 2014, vol. 2, pp. 329–336.
[16] Arnaud Lienhard, Patricia Ladret, and Alice Caplier, “How to predict the global instantaneous feeling induced by a facial picture?,” Signal Processing: Image Communication, vol. 39, pp. 473–486, 2015.
[17] Magzhan Kairanbay, John See, and Lai-Kuan Wong, “Aesthetic evaluation of facial portraits using compositional augmentation for deep cnns,” in ACCV. Springer, 2016, pp. 462–474.
[18] Davis E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
[19] Simone Bianco, Luigi Celona, Paolo Napoletano, and Raimondo Schettini, “On the use of deep learning for blind image quality assessment,” Signal, Image and Video Processing, vol. 12, no. 2, pp. 355–362, 2018.
[20] Manuel Günther, Andras Rozsa, and Terrance E. Boult, “Affact - alignment free facial attribute classification technique,” in IJCB, 2017.
[21] Ross Girshick, “Fast r-cnn,” in ICCV. IEEE, 2015, pp. 1440–1448.
[22] Xiaoou Tang, Wei Luo, and Xiaogang Wang, “Content-based photo quality assessment,” IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1930–1943, 2013.
[23] Naila Murray, Luca Marchesotti, and Florent Perronnin, “Ava: A large-scale database for aesthetic visual analysis,” in CVPR. IEEE, 2012, pp. 2408–2415.