iCartoonFace: A Benchmark of Cartoon Person Recognition

  • 2019-07-31 09:48:41
  • Shichao Li, Yi Zheng, Xiangju Lu, Bo Peng
  • 80


Cartoons receive increasingly attention and have a huge global market.Cartoon person recognition has a wealth of application scenarios. However,there is no large and high quality dataset for cartoon person recognition. Itlimit the development of recognition algorithms. In this paper, we propose thefirst large unconstrained cartoon database called iCartoonFace. We havereleased the dataset publicly available to promote cartoon person recognitionresearch\footnote{The dataset can be applied by sending email [email protected]}. The dataset contains 68,312 images of 2,639 identities.The dataset consists of persons which come from cartoon videos. The samples areextracted from public available images on website and online videos on iQiYicompany. All images pass through a careful manual annotation process. Weevaluated the state-of-the-art image classification and face recognitionalgorithms on the iCartoonFace dataset as a baseline. A dataset fusion methodwhich utilize face feature to improve the performance of cartoon recognitiontask is proposed. Experimental performance show that the performance ofbaseline models much worse than human performance. The proposed dataset fusionmethod achieves a 2\% improvement over the baseline model. In a word,state-of-the-art algorithms for classification and recognition are far frombeing perfect for unconstrained cartoon person recognition.


Quick Read (beta)

iCartoonFace: A Benchmark of Cartoon Person Recognition

Shichao Li1, Yi Zheng1, Xiangju Lu1, Bo Peng 1
1iQIYI, Inc.
{lishichao_sx, zhengyi01, luxiangju, pengbo02}@qiyi.com

Cartoons receive increasingly attention and have a huge global market. Cartoon person recognition has a wealth of application scenarios. However, there is no large and high quality dataset for cartoon person recognition. It limit the development of recognition algorithms. In this paper, we propose the first large unconstrained cartoon database called iCartoonFace. We have released the dataset publicly available to promote cartoon person recognition research11 1 The dataset can be applied by sending email to [email protected]. The dataset contains 68,312 images of 2,639 identities. The dataset consists of persons which come from cartoon videos. The samples are extracted from public available images on website and online videos on iQiYi22 2 https://www.iqiyi.com/ company. All images pass through a careful manual annotation process. We evaluated the state-of-the-art image classification and face recognition algorithms on the iCartoonFace dataset as a baseline. A dataset fusion method which utilize face feature to improve the performance of cartoon recognition task is proposed. Experimental performance show that the performance of baseline models much worse than human performance. The proposed dataset fusion method achieves a 2% improvement over the baseline model. In a word, state-of-the-art algorithms for classification and recognition are far from being perfect for unconstrained cartoon person recognition.

iCartoonFace: A Benchmark of Cartoon Person Recognition

Shichao Li1, Yi Zheng1, Xiangju Lu1, Bo Peng 1 1iQIYI, Inc. {lishichao_sx, zhengyi01, luxiangju, pengbo02}@qiyi.com

Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.


Nowadays, cartoons have received increasingly attention. The exponential rise of digital media has greatly promoted the development of cartoon industry. According to the survey(???), in 2018, the global animation market was worth 259 billion U.S. dollars, and is expected to grow to 270 billion by 2020. The major animation markets include the United States, Canada, Japan, China, France, Britain, Korea and Germany. Most of the segments in the animation industry are growing at the rate of 5% YoY. There are more than 300 animations produced each year in Japan alone. Therefore, it is necessary to develop recognition algorithms in this important and huge field. The algorithms can be applied in varies of commercial scenarios e.g. advertisements, ”just look him” and providing creative materials, etc.

Figure 1: Some examples of proposed iCartoonFace dataset.

In the area of face recognition, rich data is publicly available for research. There are numerous dataset(???????) from constrained situation to unconstrained situation, the identities from several hundreds to hundreds of thousands, and the sample number from tens of thousands to millions. Several datasets e.g. LFW(?) and MegaFace(?), provide evaluation protocols and rankings. Rich publicly available face data have greatly promoted the research of face recognition. ArcFace(?) reached a precision of 99.83% on LFW benchmark, which had surpassed the human performance. The best results on MegaFace has also reached 99.39%. However, in the field of cartoon face recognition, such dataset is missing.

There are several cartoon datasets have been developed. IIIT-CFW(?) has only 8,928 samples of 100 persons, and the evaluation criteria are not standardized. Danbooru2018(?) is a large-scale cartoon dataset, it provides detailed annotations includes identity tag. However, the images are created by different artists, so that the images may be very different with original cartoon images and each other. Figure 2 shows a comparison between original image and danbooru images. It can be observed that the danbooru dataset is unsuitable for recognition or classification task. Manga109(?) is created for detection, and it has only few identities. WebCaricature(?) is a photograph-caricature dataset, it has only 252 persons and corresponding caricatures. There are huge amounts of unlabeled cartoon data in online cartoon videos. However, there is no dataset extracted from cartoon videos so far.

Figure 2: Samples of original cartoon image and danbooru images. The three on right are samples which labeled uchiha sasuke from danbooru dataset.

In this paper, we introduce the iCartoonFace dataset to evaluate and encourage development of cartoon person recognition. It contains 68,312 images of 2,639 identities. These images are extracted from public images on websites and online videos in iQiYi. All the images are manually labeled, which makes it a good benchmark to evaluate cartoon face recognition algorithms. Note that the dataset contains not only persons, but also other identities that appear in cartoon videos such as animals and monsters. Figure 1 shows some examples of the proposed dataset. The process of assembling iCartoonFace is given in detail on section 3.

Deep learning algorithms have been achieved near perfect results on image classification and face recognition. We evaluate these algorithms on proposed iCartoonFace dataset to give a baseline for researchers. It will be described in detail on section 4 and 5. In addition, a dataset fusion between face dataset and cartoon dataset is proposed try to improve the performance.

In summarize, our contributions are:

  • We develop first large unlimited dataset for cartoon person recognition. Standard evaluation metrics for researches are given for evaluating different algorithms better.

  • We evaluated state of the art classification and recognition algorithms on proposed iCartoonFace dataset. The results and analysis are given for a guideline.

  • A dataset fusion method is proposed to improve the performance of algorithms. The experimental results show that the proposed dataset fusion method improves performance of baseline models effectively.

Related work


There are numerous face datasets proposed. For instance, Labeled Faces in the Wild(LFW)(?) database of face images is designed for studying the problem of unconstrained face recognition. The database contains more than 13,000 images of 5,749 characters. MegaFace(?) is a large scale face database. It contains 1 million faces photos of 690k persons and gives detailed evaluation protocols. WebFace(?) contains more than 10k identities and about 500k images for unlimited face recognition. CAS-PEAL(?) contains more than 30k images of 1,040 persons for constrained face recognition, mainly includes gestures, expressions, lighting variations.

There are several cartoon datasets proposed before. IIIT-CFW(?) contains 8,928 annotated unconstrained cartoon faces of 100 international celebrities. IIIT-CFW can be used for wide spectrum of problems due to the fact that it contains detailed annotations such as type of cartoon, pose, expression and age group, etc. WebCaricature(?) is a large photograph-caricature dataset consisting of 6042 caricatures and 5974 photographs from 252 persons collected from the web. For each image in the dataset, 17 labeled facial landmarks are provided. All two cartoon datasets are created from caricature. Manga109(?) is a dataset of a variety of 109 Japanese comic books and created for detection. There is no dataset publicly available for cartoon person recognition so far. DanbooruCharacter(?) is created from Danbooru(?) dataset. It contains more than 970k pictures of 70k persons. The images in the dataset may be varies from original image after artistic processing by the creator. The authors provide a baseline model which can only reach 37.3% accuracy. Thus, it is essential to create a large cartoon dataset for cartoon person recognition.

Related studies

Face recognition can be seen as a sub-problem of image classification. Numerous models (????) are proposed for solving image classification problem and get impressive results. For instance, ResNet(?) presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously. DenseNet(?) connects each layer to every other layer in a feed-forward fashion. NASnet(?) study a method to learn the model architectures directly on the dataset of interest. ResNeXt(?) is constructed by repeating a building block that aggregates a set of transformations with the same topology. SEnet(?) adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. These models have strong feature extraction ability.

Figure 3: The process of assembling iCartoonFace dataset.

In addition to network architecture, loss function is also get attention of researchers. SoftMax loss function in traditional classification task does not explicitly expand decision boundaries. Thus many works pay attention on maximize inter-class distance and minimize intra-class distance by loss functions. SphereFace(?) proposes A-SoftMax that enables convolutional neural networks(CNNs) to learn angularly discriminative features. CosFace(?) maximize the decision margin in the cosine space. Arcface(?) maximize the decision margin in the angular space. The decision boundaries of these loss functions are shown in table 1. The loss functions reached start-of-the-art results respectively in the field of face recognition.

Loss Functions Decision Boundaries
SoftMax (W1-W2)x+b1-b2=0
SphereFace(?) x(cos(mθ1)-cosθ2)=0
CosFace(?) s(cosθ1-m-cosθ2)=0
ArcFace(?) s(cos(θ1+m)-cosθ2)=0
Table 1: Decision boundaries of different loss functions.

However, these methods mentioned above may not suitable for cartoon face recognition. The intra-class distance may be huge due to the variance of color, pose and exaggerated description for cartoon persons. The inter-class distance may be small if two person have similar hairstyle and face type. Thus cartoon person recognition is a challenging problem to solve.

Assembling iCartoonFace

Figure 4: iCartoonFace statistics. We present randomly selected images(with provided detections in red), region distributions of identities, 3D pose information, image resolution distribution, sharpness and number of samples per person.

In this section, we provide an overview of the iCartoonFace dataset, how it was collected and optimized, and its statistics. The flow chart of the assembling process is shown in Fig 3. We create iCartoonFace to evaluate and drive the development of cartoon person recognition algorithms. The dataset is publicly available.

Downloading cartoon picture.

We use variety of ways to collect cartoon images. We first collect cartoon album list from iQiYi internal system. According this list, we crawl the main characters for each album from Baidu Encyclopedia and other public websites33 3 http://www.chuanxincao.com/44 4 https://zh.moegirl.org/. By this method, we get a list of persons and corresponding albums. We also get a typical image for each person from these websites. Then, according the list, we crawl images from Baidu and Google image search engine to get publicly available data. Queries are given in the form ‘[person name] + [album name]’. For the data which not publicly available from iQiYi company, random 10 episodes for each albums are select. Then extract one frame per second from the episodes and download the extracted images. From these ways, millions of pictures were downloaded.

Detection and remove duplication

There are a great number of irrelevant or duplicate images in downloaded pictures. So that an algorithm of detection and removing duplication is applied. It consists of a detection model and a feature extraction model. The detection network is trained on private dataset with 149k annotated images. It refers to the network structure of ResNet-50(?) and RetinaNet(?).It can achieve 0.5 IoU and 90.8% mAP in our private dataset. The feature extraction network is trained on private dataset with more than 240k images. It refers to DenseNet-169(?) and SphereFace(?). It achieves 91% recognition accuracy on our private dataset. The algorithm of detection and remove duplication is as follows. After cartoon faces are detected using the detection network, we crop them such that the face spans 50% of the photo height and width, thus including the full head (Fig 4). We calculate the sharpness of images using Laplacian and the images will be thrown away if the value are less than 30 or the resolution less than 100. It can ensure that the images are clear enough. Then the features are extracted using feature extraction network. Feature similarity is calculated by using Faiss(?). Duplicate images are removed if the similar confidence greater than 0.95. Same process is applied on typical images, with the processes of removing duplication by person name and manual annotation, to ensure that everyone has only one identity.

Obtaining candidate images for each identity

In this stage, each image will be labeled with an initial identity. Features extracted from typical images are used as reference set. The similarity is calculated using Faiss(?) between the features extracted from unlabeled samples and reference set. The samples are marked as the label of corresponding reference image if the maximum similar confidence greater than 0.6. Note that the initial label do not need to be infallible, as the images will be labeled by manual annotation.

Final manual annotation

All clusters of images are cleaned by a manual annotation process. We developed an image annotation system. In the annotation page, one part shows a reference typical image, and the other part displays the image to be labeled. The labelers need to determine whether the image need to be labeled has the same identity with the reference image.


The dataset contains 68,312 samples. They are labeled with 6,329 identities which come from 739 cartoon albums. The statistics of iCartoonFace are presented in Fig 4.

  • Variety of expressions, illumination, facial decoration, gender, and many more variations.

  • Regional distribution of cartoon persons. Most of persons come from Japan. It is intuitive since the animation industry is pretty developed in Japan. Observe that there are also a part of persons come from other regions such as China, Europe and America.

  • 3D pose information. Random 10,000 samples are selected and annotate their 3D pose information i.e. yaw, pitch and roll angles. Results shows that about 66% samples have small angle less than 30 degree, 25% samples have medium angle less than 60 degree, and about 8% samples have large angle with more than 60 degree up to 90 degree.

  • Image resolution. All the samples in iCartoonFace have resolution more than 100 pixels. More than 50% of the samples have resolution more than 200 pixels.

  • Sharpness of image. The image quality is calculate by Laplacian. We resize the images to resolution of 256 x 256. Then calculate the sharpness. The value of most samples is more than 100, to indicate that most samples are clear.

  • Number of samples per person. Most number of samples per identity is greater than 3. Note that there are 3 person have 3 samples, 14 person have 4 samples.

We believe that the proposed dataset is extremely useful for recognition research and cartoon person modeling.

The iCartoonFace Challenge

In this section, we describe the iCartoonFace challenge, evaluation protocols and several baseline models. Our goal is to test performance of cartoon person recognition algorithms.

Recognition scenarios

A biometric system typically operates in either verification mode or identification mode(?). Therefore, our recognition scenarios includes identification and verification. In identification, the probe image is not labeled with any identity. In order to identify the label, the model should compare the features extracted from probe image to every template in the gallery. This type of matching operation is referred as 1:N matching, as N is equal to gallery size. Verification can be referred as 1:1 matching. The probe is compared to a certain image to determine whether they have same identity. Identification or verification alone can not characterize performance of recognition model(?). The detailed protocols are described as follows.

Identification: given a probe image, and a gallery containing at least one photo of the same person, the algorithms rank-orders all images in the gallery based on similarity to the probe. Specifically, the probe set includes N people; for each person we have M photos by adding it the gallery of distractors and use each of the other M-1 photos as a probe. Results are presented with Cumulative Match Characteristics (CMC) curves- the probability that a correct gallery image will be chosen for a random probe by rank=K.

Figure 5: A sample of probe set. The images are belong to one class.
(a) Negative pairs
(b) Positive pairs
Figure 6: Samples of verification set. Left are negative pairs, right are positive pairs.

Verification: a pair of images is given and the algorithm should output whether the person in the two images is the same or not. We report verification results with receiver Operating Characteristic Curves (ROC) -the tradeoff between falsely accepting non-match pairs and falsely rejecting match pairs. AUC(area under curve) value and best accuracy are also reported.

(a) CMC
(b) ROC
(c) Distractors
Figure 7: Performance of baseline algorithms. The number of distractors for identification is 2500.

Evaluation sets

All sets are given with images and bounding boxes to indicate the location of faces.

Model DenseNet-169 ResNet-50 ResNeXt-50 NASnet SE-ResNet-50 SE-ResNet-101
Rank-1(%) 57.77 53.60 48.77 57.28 53.69 56.35
Best Acc(%) 77.43 76.98 76.93 77.65 77.92 77.33
AUC 0.8647 0.8630 0.8591 0.8705 0.8631 0.8658
Table 2: Results on different algorithms. Rank-1 is the top-1 accuracy on identification set. Best acc is the verification accuracy on best threshold. AUC is the area under the ROC curve.
(a) without face
(b) with face
Figure 8: A schematic to indicate face and cartoon feature position in the feature space.

Gallery set. Gallery set is created for identification test. As mentioned before, identification is referred as 1:N matching. In order to ensure that gallery set only have one sample of label of probe image, the gallery set is created by the samples that their identity do not appear in the training set and probe set. In this situation, algorithms need not know the label of gallery samples. Thus the added sample in M images is the only one sample of corresponding identity. The gallery set contains up to 2,500 images with 2500 different identities. Note that due to the characteristic of cartoon data,i.e. some classes have highly similar appearance, identification is a challenging task as the size of gallery set increasing.

Probe set. Probe set is used to identification test. The identification probe set contains 10,000 samples from 2,000 persons (1,200 of person identity which in training set and 800 of person identity which not in training set). The number of samples for each identity ranges from 3 to 7. The variance of probe samples diverse, including posture, illustration and occlusion etc. Figure 5 shows a sample of probe set.

Verification set. The verification set is created for verification test. Similar to LFW benchmark(?), it contains 3,000 negative pairs and 3,000 positive pairs. In order to ensure that the test samples do not appear in training set. The identities in verification set do not appear in training set. Note that random selection of pairs makes the test set too simple. Therefore, motivated by hard negative mining(?), the verification set is created by feature similarity. For positive pairs, we sample part of sample with low similarity in same classes. For negative pairs, we sample part of sample with high similarity in different classes. So that the verification set may be pretty hard. Figure 6 shows some sample pairs of verification set.

Evaluation and baselines

In identification test, for each probe class, after adding a sample to gallery set, algorithms calculate features of each sample both in probe set and gallery set. Feature distance is calculated between probe set and gallery set. Then rank-k accuracy and CMC curves are calculated. In verification test, the test set is divided two parts. One is 10-fold cross validation. In detailed, verification set is randomly divided into 10 parts. 9 of them are used to find best classified threshold, 1 of them is used to test performance. Average accuracy is the accuracy of verification test. 10-fold cross validation can avoid unfairly overfitting during development. Another is plot ROC curves and calculate AUC value by calculate True Positive Rate(TPR) and False Positive Rate(FPR) under different classified thresholds.

As a baseline, we implemented several models on our dataset as follows: (a) ResNet(?), a residual learning framework, (b) DenseNet(?), a dense convolutional network, (c) NASnet(?), who can learn network architectures automatically, (d) ResNeXt(?), an improved version of ResNet, (e) SEnet(?), who can learn the importance of channels.

Dataset Fusion

As mentioned before, there are numerous works pay attention to loss functions to maximize inter-class distance and minimize intra-class distance(???). In this part, we try to accomplish this goal by using dataset fusion methods. The motivation of dataset fusion is that person faces and cartoon faces have similar shapes. Therefore, the trained cartoon recognition model may output a high probability when inputting a person image. Figure 9 shows some cartoon examples. The labels of examples are wrong assigned to a face image with high confidence. The fact indicates that feature extracted from faces may be close to feature extracted from cartoons in feature space as shown in figure 8. So that adding face images to cartoon dataset may be helpful to training a better model.

Figure 9: Some samples which labels are wrong assigned to a face image.

An intuitive idea is that training a recognition model which can recognize faces and cartoons. As an multi-task learning(MTL)(?). Several MTL works(???) have shown promising performance. In our work, we simply apply a MTL network which train one network for both face and cartoon data. The method called inter-face dataset fusion.

Our another attempt is that keep classes number unchanged, adding face data to cartoon classes. CASIA-WebFace(?) dataset is used for dataset fusion. Adding face data to cartoon data is equivalent to adding similar features. Face features are inserted into feature space, and its position is between different classes of cartoon features. As a cartoon class, the face features can drive the model to expand the inter-classes distance. The method called intra-face dataset fusion.


This section describes the results and analysis on different enlarged rates, baseline models, loss functions and dataset fusion methods. Unless otherwise mentioned, during training, the initial learning rate is set to 0.1, batch size is set to 128, enlarged rate is set to 1.0, total epoch is set to 100. The learning rate will be decayed to one-tenth when epoch reached 40%, 60%, 80% of total epoch.

Enlarged rates

As mentioned before, we cropped images by enlarged rate 1.0, i.e. enlarge the detected images that the width and height are twice of the detected face. An intuitive idea is that different enlarged rates may have different influence of a same algorithm. Therefore, in this part, different enlarged rates are implied to evaluate the influence. Figure 10 shows an example of different enlarged rates.

Considering network complexity and performance, the experiments are implied on DenseNet-169 and SphereFace algorithms. The results shown in table 3 , it can be observed that performance get better while adding enlarged rate. This is due to the fact that the images with small rate lack of hair information, which is important for cartoon person recognition. An image with larger rate contains more hair or other information help improve the performance. Note that the verification performance of different enlarged rates are similar with random parameter initial. The reason is that 1:1 matching just need to compare features of two samples rather than N(2,501 in this experiment) samples. The comparison need not too much information can get a correct result.

Enlarged rate 0.2 0.4 0.6 0.8 1.0
Rank-1(%) 56.73 56.36 57.08 57.43 57.77
Best Acc(%) 77.73 77.62 77.3 77.46 77.43
AUC 0.8619 0.8638 0.8627 0.8628 0.8647
Table 3: Results on different enlarged rates. Rank-1 is the top-1 accuracy on identification set.
Figure 10: A sample of different enlarged rate. From left to right is:0.2, 0.4, 0.6, 0.8, 1.0 .

Baseline models

Figure 7 shows the results on different algorithms. Figure 7(a) shows the performance with respect to different rank-k accuracy in terms of identification, i.e. rank-1 means that the correct match got the best score from the whole gallery database, rank-10 means the correct match is in the first 10 matches. Rank-1 accuracy is shown in table 2. Figure 7(b) shows the verification results with different models. In addition to rank-1 accuracy, table 2 shows accuracy under best threshold and AUC value, i.e. area under ROC curve.

It can be observed that all the models perform not well. The rank-1 accuracy of best algorithm, i.e. DenseNet-169, only has 57.77%. Note that the SE module has a little improvement of ResNet. ResNet-101 performs well than ResNet-50 indicates that deep network may work better than shallow net. Figure 11 shows some bad cases which is difficult to recognize. Due to the change of color, the similarity of features in same class dropped. It also can be observed that two person have highly similar features if they have similar hair style, eyebrows and face type. However, in cartoon videos and pictures, the painting style of many characters are similar, with only minor differences. For example, a cartoon girl tends to have a pair of big eyes, small nose and pointed chin. But their hair style, hair color, eyebrow, facial decoration, etc. may be different. So that the ability of feature extraction is important in cartoon recognition. An attention model which can learn the importance automatically of part of face may improve the performance.

We also observed that probability of occurrence of similar samples increases as the gallery size gets larger. So that rates drop for all algorithms as shown in figure 7(c).

Figure 11: A bad case in identification test. (a) is a probe image. (b) is correct sample in gallery set. (c) to (e) are the distractors which have the most similar features with query image.

(a) CMC

(b) ROC
Figure 12: Performance of loss functions. The number of distractors for identification is 2500.

Loss Functions

We also compared several loss functions designed for face recognition, i.e. SphereFace(?), CosFace(?) and ArcFace(?). The SphereFace can be written as

Lsphere=1Ni-log(exiψ(θyi,i)exiψ(θyi,i))+jyiexicos(θj,i) (1)


ψ(θyi,i)=(-1)kcos(mθyi,i)-2k, (2)

m1 is an integer that controls the size of angular margin.

The CosFace can be written as

Lcos=1Ni-loges(cos(θyi,i)-m)es(cos(θyi,i)-m)+jyiescos(θj,i) (3)


W=W*W*,x=x*x*,cos(θyi,i)=WjTxi. (4)

The ArcFace can be written as

Larc=-1mimloges(cos(θyi+m))es(cos(θyi+m))+j=1,jyinescos(θj) (5)


Wj=Wj*Wj*,xi=xi*xi*,cos(θj)=WjTxi. (6)
Loss SoftMax SphereFace CosFace ArcFace
Rank-1(%) 57.07 57.77 68.91 66.64
Best Acc(%) 76.20 77.43 76.20 78.05
AUC 0.8583 0.8647 0.8605 0.8598
Table 4: Results of loss functions.

The experimental results is shown in Figure 12 and table 4. It can be observed that SphereFace(?), CosFace(?) and ArcFace(?) loss performs better than SoftMax loss. Compared to SoftMax, SphereFace, CosFace and ArcFace expand inter-class angular margin to expand inter-class distance and reduce intra-class distance. CosFace has highest rank-1 accuracy. CosFace and ArcFace perform much better than SphereFace. They can reach about 10% improvement compared with SphereFace. The reason is that they have explicit decision boundary. Decision boundary of SphereFace will be changed while θ changing.

Dataset fusion

In this part, we attempt dataset fusion methods, i.e. add faces samples to iCartoonFace dataset. The results of inter-face fusion method are displayed in table 5. It can be observed that the performance get better when face classes number and cartoon classes number close to 1:1. The rank-1 accuracy can reach about 2.5% improvement compared with no face data. However, the performance stop increasing with face classes number get more larger, i.e. the accuracy of 5,278 face classes is approximately equal to 2,639 face classes. The results shows that face feature can help training a better model.

Table 6 shows the results of intra-face dataset fusion method. It can be observed that intra-face fusion can improve performance more effectively rather than inter-face. The best accuracy reaches 62.51%, 2.3% higher than inter-face. The method force model to map features to a high dimension space which makes features have stronger representation ability. And the inter-face fusion method has no explicit influence on the representation ability of features.

Faces classes 0 1000 2639 5278
Rank-1(%) 57.77 58.01 60.31 60.25
Best Acc(%) 77.43 77.83 77.78 77.65
AUC 0.8647 0.8644 0.8695 0.8710
Table 5: Results on inter-face dataset fusion method.
Faces classes 0 1000 2639 5278
Rank-1(%) 57.77 58.32 61.60 62.51
Best Acc(%) 77.43 77.52 78.18 77.98
AUC 0.8647 0.8660 0.8704 0.8718
Table 6: Results on intra-face dataset fusion method.


In this paper, iCartoonFace dataset was been developed. It contains 2,639 cartoon identities and 68,312 samples. iCartoonFace is available to reseachers and we presented results from state of the art models and loss functions. Different methods showed varies performance, but all of them were not well enough, therefore iCartoonFace was a challenging cartoon dataset. We attempted different enlarged rates and datafusion methods, they showed some effects of performance but the effects were not significant. A dataset fusion method was proposed and showed a promising performance. The future works may pay more attention to more effective dataset fusion methods(e.g. feature fusion), data pre-process and constrained loss functions. We believe that iCartoonFace will promote research in cartoon face recognition and image classification.


  • [Anonymous et al. 2019] Anonymous; community, D.; Branwen, G.; and Gokaslan, A. 2019. Danbooru2018: A large-scale crowdsourced and tagged anime illustration dataset. https://www.gwern.net/Danbooru2018. Accessed: DATE.
  • [Association of Japanese Animations 2018] Association of Japanese Animations. 2018. The report on japanese animation industry 2018. https://aja.gr.jp/english/japan-anime-data.
  • [Caruana 1997] Caruana, R. 1997. Multitask learning. Machine learning 28(1):41–75.
  • [DeCann and Ross 2012] DeCann, B., and Ross, A. 2012. Can a “poor” verification system be a “good” identification system? a preliminary study. In 2012 IEEE International Workshop on Information Forensics and Security (WIFS), 31–36. IEEE.
  • [Deng et al. 2019] Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4690–4699.
  • [Felzenszwalb et al. 2009] Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and Ramanan, D. 2009. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32(9):1627–1645.
  • [Fujimoto et al. 2016] Fujimoto, A.; Ogawa, T.; Yamamoto, K.; Matsui, Y.; Yamasaki, T.; and Aizawa, K. 2016. Manga109 dataset and creation of metadata. In Proceedings of the 1st International Workshop on coMics ANalysis, Processing and Understanding,  2. ACM.
  • [Gao et al. 2007] Gao, W.; Cao, B.; Shan, S.; Chen, X.; Zhou, D.; Zhang, X.; and Zhao, D. 2007. The cas-peal large-scale chinese face database and baseline evaluations. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 38(1):149–161.
  • [GlobeNewswire 2017] GlobeNewswire, P. R. 2017. Global animation industry strategies trends & opportunities report 2017. https://markets.businessinsider.com/news/stocks/global-animation-industry-strategies-trends-opportunities-report-2017-1005554667.
  • [Gross et al. 2010] Gross, R.; Matthews, I.; Cohn, J.; Kanade, T.; and Baker, S. 2010. Multi-pie. Image and Vision Computing 28(5):807–813.
  • [He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • [Hu, Shen, and Sun 2018] Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.
  • [Huang et al. 2008] Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E. 2008. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition.
  • [Huang et al. 2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708.
  • [Huo et al. 2017] Huo, J.; Li, W.; Shi, Y.; Gao, Y.; and Yin, H. 2017. Webcaricature: a benchmark for caricature face recognition. arXiv preprint arXiv:1703.03230.
  • [Jain et al. 2004] Jain, A. K.; Ross, A.; Prabhakar, S.; et al. 2004. An introduction to biometric recognition. IEEE Transactions on circuits and systems for video technology 14(1).
  • [Johnson, Douze, and Jégou 2017] Johnson, J.; Douze, M.; and Jégou, H. 2017. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734.
  • [Kemelmacher-Shlizerman et al. 2016] Kemelmacher-Shlizerman, I.; Seitz, S. M.; Miller, D.; and Brossard, E. 2016. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4873–4882.
  • [Kumar et al. 2009] Kumar, N.; Berg, A. C.; Belhumeur, P. N.; and Nayar, S. K. 2009. Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision, 365–372. IEEE.
  • [Lin et al. 2017] Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  • [Liu et al. 2017] Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; and Song, L. 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 212–220.
  • [Long et al. 2017] Long, M.; Cao, Z.; Wang, J.; and Philip, S. Y. 2017. Learning multiple tasks with multilinear relationship networks. In Advances in neural information processing systems, 1594–1603.
  • [Lu et al. 2017] Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.; and Feris, R. 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5334–5343.
  • [Mishra et al. 2016] Mishra, A.; Rai, S. N.; Mishra, A.; and Jawahar, C. 2016. Iiit-cfw: a benchmark database of cartoon faces in the wild. In European Conference on Computer Vision, 35–47. Springer.
  • [Misra et al. 2016] Misra, I.; Shrivastava, A.; Gupta, A.; and Hebert, M. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3994–4003.
  • [Ng and Winkler 2014] Ng, H.-W., and Winkler, S. 2014. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP), 343–347. IEEE.
  • [Wang et al. 2018] Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; and Liu, W. 2018. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5265–5274.
  • [Wang 2019] Wang, Y. 2019. Danbooru 2018 anime character recognition dataset. https://github.com/grapeot/Danbooru2018AnimeCharacterRecognitionDataset.
  • [Watson 2017] Watson, A. 2017. Size of the animation market worldwide from 2017 to 2020 (in billion u.s. dollars). https://www.statista.com/statistics/817601/worldwide-animation-market-size.
  • [Wolf, Hassner, and Maoz 2011] Wolf, L.; Hassner, T.; and Maoz, I. 2011. Face recognition in unconstrained videos with matched background similarity. IEEE.
  • [Xie et al. 2017] Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1492–1500.
  • [Yi et al. 2014] Yi, D.; Lei, Z.; Liao, S.; and Li, S. Z. 2014. Learning face representation from scratch. arXiv preprint arXiv:1411.7923.
  • [Zoph et al. 2018] Zoph, B.; Vasudevan, V.; Shlens, J.; and Le, Q. V. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8697–8710.