Using Fully Convolutional Neural Networks to detect manipulated images in videos

  • 2019-11-29 18:10:36
  • Michail Tarasiou, Stefanos Zafeiriou
  • 0


We propose a compact architecture based on fully convolutional neuralnetworks (FCN) to detect manipulated images of human faces. In contrast toexisting FCN architectures for classification, here the final layer feature mapexhibits large spatial dimensions with non-global receptive field. The finallayer features are spatially averaged using global average pooling (GAP) toprovide more robust features. We leverage the structure of the FCN to derive astraightforward way for joint classification and forgery localization trainingand show that the network's classification performance improves significantlyby the addition of a pixelwise classification loss. The trained networksachieve state of the art results in binary classification in the {\itFaceForensics++} dataset and competitive performance in other tasks using asignificantly reduced number of parameters and small resolution input images.Additionally, we examine how well the proposed architecture can detect fullygenerated images using faces from the recently proposed PGAN and StyleGANmethods. We show that this task is easier to learn than detecting manipulatedimages and that for both cases there is only a small drop of performance whenthe network is trained using more than one manipulation technique in thetraining data.


Quick Read (beta)

Using Fully Convolutional Neural Networks to detect manipulated images in videos

Michail Tarasiou1 and Stefanos Zafeiriou2 1 Imperial College London 2 Facesoft This work was not supported by any organization    Michail Tarasiou1,2, Stefanos Zafeiriou1,2
1Imperial College London 2Facesoft

We propose a compact architecture based on fully convolutional neural networks (FCN) to detect manipulated images of human faces. In contrast to existing FCN architectures for classification, here the final layer feature map exhibits large spatial dimensions with non-global receptive field. The final layer features are spatially averaged using global average pooling (GAP) to provide more robust features. We leverage the structure of the FCN to derive a straightforward way for joint classification and forgery localization training and show that the network’s classification performance improves significantly by the addition of a pixelwise classification loss. The trained networks achieve state of the art results in binary classification in the FaceForensics++ dataset and competitive performance in other tasks using a significantly reduced number of parameters and small resolution input images. Additionally, we examine how well the proposed architecture can detect fully generated images using faces from the recently proposed PGAN and StyleGAN methods. We show that this task is easier to learn than detecting manipulated images and that for both cases there is only a small drop of performance when the network is trained using more than one manipulation technique in the training data.


Image manipulation software, e.g. Adobe Photoshop, GIMP has made images unreliable as a source of evidence. While photographs can be admissible in court, in case of a dispute the burden of authentication falls to the party introducing them into evidence [legal]. Individuals share the same intuition. After decades of exposure to manipulated image content we do not think of images as self-evidently portraying an accurate representation of reality and in most cases an image will need to be authenticated before it can be received as such. Up until recently, video content has been thought of as a more reliable source of information, given that realistic tampering required a considerable amount of resources and in most cases could been easily identifiable by experts.
This has changed with the rise of deep learning based techniques for image generation and manipulation as well as developments using computer vision based techniques. While these techniques have the capacity to kickstart a revolution in computer graphics and digital content creation, if misapplied can definitely have applications of malevolent abuse and severe negative impacts on human rights standards (e.g. privacy, stigmatization, discrimination). Additionally, previously successful image forensics methods do not generalize to the new techniques or to the artifacts produced during strong video compression which is commonly applied when a video is uploaded in social media platforms. Thus, it is critical to develop tools that help the automatic authentication of video content.
In this paper we propose a methodology for detecting manipulated images of faces in compressed videos. Our key contributions are the following:

  • we design a FCN [fcn] architecture specifically for the task of detecting manipulated images. The trained models outperform all networks of similar size or smaller in the FaceForensics++ dataset by a large margin on all tasks and achieve >95% accuracy in detecting images generated from sate-of-the-art GANs under high compression

  • we show that adding a segmentation loss component and jointly training for forgery classification-localization improves model performance significantly over the base case for all examined tasks. In doing so, we achieve state-of-the-art results in the FaceForensics++ dataset for manipulation specific training under medium compression while having a twenty fold reduction over the number of model parameters

  • the suggested model uses input images of size 128×128, significantly smaller than competing architectures. This is of particular interest for detecting forgeries in a low resolution setting, where model performance could be reduced by upscaling images to fit input size.


II-A Image Manipulation

In this context identity manipulation can be achieved by replacing crops of a source and target face while ensuring that the face orientation of the target matches that of the source. FaceSwap uses facial landmarks to scale and align such crops of source and target faces. A 3D face model is fitted using the facial landmarks and blended with the source image using alpha blending. Deepfakes employ a deep learning approach to face swapping by use of deep autoencoders on individual face crops. The encoder part of the network is shared between all identities in a dataset, a separate decoder is trained for each identity. This forces the encoder to retain information such as orientation and illumination while the facial characteristics that constitute identity are modelled entirely by the decoder. After training, in order to swap faces the target video frames are passed through the common encoder, followed by the decoder corresponding to the source identity. Computer vision based techniques can also be used to warp an image to produce a desired effect. Face2Face [f2f] is a system for facial reenactment that manipulates pixels in a target video such that the facial expressions, pose and illumination match those in a source video. The system only requires monocular video and was the first one to achieve high performance while used in real time.
Generative Adversarial Networks (GANs) [gan] were introduced to model the data distribution of a training set and produce new samples representative of that distribution. They achieve this task by training two networks in parallel, a generator network receives a fixed size random input and produces an output matching the desired dimensions. A discriminator network receives the output of the generator as well as images from the training set. In each case the role of the discriminator is to classify each image as real or generated and is trained by minimizing a classification loss. On the other hand, the generator has the objective of maximizing the discriminator’s loss resulting in a minimax optimization problem. GANs were the first models to show promising results in the field of image generation. Conditional GANs [cgan] use labels to generate samples corresponding to a specific category. The introduction of cycle consistency loss [cyclegan, discogan] enables training of models that can translate between a source and target domain without the need for paired examples. StarGAN [stargan] can perform image-to-image translations for multiple domains using only a single model, showing impressive results in altering the hair color, gender, age, skin tone and emotion of human faces. PGANs [pgan] gradually increase the resolution of generated images by adding layers to the generator and discriminator, being the first ones to achieve 1024x1024 pixel realistic images of human faces. StyleGAN [stylegan] takes inspiration from research on style transfer [styletransfer] to propose a generator architecture that can learn to discriminate between high level features and stochastic variation e.g. in the context of human face generation. StyleGAN uses the same principle of progressive growing of generated images and is capable of producing highly realistic human faces [tpdne].

II-B Image Manipulation Detection

Image forensics and steganalysis have a long history of research in verification of the authenticity of image based content. They focus on detecting a set of predefined manipulation artifacts that result from key image manipulation operations such as copy-move [cpmv], splicing [splice], rescale [rescale] and rotation [rotate] operations or application of median filters [median1] among others. These detectors generally work great for the type of forgery they were designed for but their use is limited by the large number of forgery operations a comprehensive system should be able to detect. Additionally, in the context of videos image forensics models have been shown to deteriorate significantly in performance in the presence of strong compression [facefor++].

More recent literature includes many works using deep learning methods to detect manipulated or generated image content. CNNs trained end-to-end using stochastic gradient descent have been shown to generally outperform other approaches. However, it is common for researchers to draw inspiration from forensics based techniques to introduce useful structural priors to CNN architectures for detecting manipulated content.

Inspired from steganalysis [bayar] propose a CNN input layer specially designed to detect manipulation features by learning a set of prediction error filters on image pixels while suppressing information on image content. This is achieved by constraining the convolutional kernel to have a central value of -1 and all remaining values to sum to 1.0, enforcing these constraints after each gradient descent update. The resulting layer is forced to model the relationship between pixels and their neighbourhood which is expected to differ in manipulated regions irrespective of content.

The use of CNNs to distinguish between computer generated and real photographic images was explored in [rahmouni]. Their model includes a convolutional feature extractor and a global pooling layer that calculates low order moments from the CNN features that are passed to a classifier. Applying this process on high resolution samples they split an image into patches, calculate class probabilities for all patches and propose a voting scheme to predict the image class.

Two compact CNN architectures for detecting Deepfakes and Face2Face manipulated videos are presented in [meso]. The authors argue that the strong compression in videos will degrade information in the low level features. Instead they design the networks to focus on the mesoscopic properties of images.

Results from training and testing various models provided in literature ([rahmouni, bayar, meso]) is presented in [facefor++]. Additionally, they train a version of the Xception network [xception] starting from pretrained ImageNet weights. They show that all methods achieve higher than 95% accuracy for uncompressed images but performance drops greatly for medium and high levels of compression apart from Xception that achieves the best performace overall. The Xception model is further adapted to use 128×128 image patches and perform pixel wise forgery localization. As a benchmark for human performance a human study is performed showing that current computer based methods clearly outperform humans on the task of detecting manipulated images of faces.

In [cnnrnn] the availability of temporal information to detect face tampering is explored using a model framework common in video processing with deep neural networks and training on the FaceForensics++ dataset. As a first step they train a backbone CNN architecture (ResNet, DenseNet) on single images. This is followed by training a temporal model (bidirectional RNN) on CNN features and then training the system end-to-end. For high compression a DenseNet architecture achieves state of the art precision in binary classification for all classes trained with each class separately.

The intuition that the final image rescaling and alignment steps common in various deepfakes pipelines should introduce face warping artifacts is further explored in [warpdet] who propose a self supervised technique for training deep CNN to detect deepfakes without the need for expensive negative examples. To do so they aim at simulating those artifacts during training by the following steps: extract and align face regions, introduce Gaussian blur smoothing to the aligned face region and affine warp back to the size and orientation of the initial image. Testing on the datasets from [deeptimit, headpose] their method appears to detect deepfakes with high accuracy. In [iio] the authors take advantage of a bias of facial datasets not to include faces with their eyes closed. As a result deepfakes techniques training on those datasets tend to produce unrealistic blinking patterns which they use to detect fake content. However, upon recognising that bias it became easy for deepfakes techniques to produce realistic blinking patterns by including pictures of people with their eyes closed in training datasets. Observing that deepfakes final steps of splicing the generated face to the target image should introduce head pose inconsistencies [headpose] estimate 3D poses of faces separately from landmarks for the central region of the face and landmarks for the whole face and learn a SVM classifier over differences of the pose estimates.
Placing emphasis on the need to protect against the negative effects of manipulated video content of leaders [world_leaders] use features derived from head pose and facial landmarks tracking over time to model a person mannerisms while speaking. Their models are able to predict fake content with high accuracy even in high compression but are person specific and thus capable of solving the problem only for specific individuals.
The performance of several image forgery detectors is examined in [marra] for image-to-image translation methods based on GANs. They show very good performance for all models on uncompressed images which significantly drops under the presence of compression especially with only deep architectures achieving high test classification accuracy under these conditions.


Fig. 1: Data Preprocessing. (Left) Sample frame from FaceForensics++ dataset and face crop box in green, (Right) Close up to face region, original crop box is shown in green as well as five facial landmarks, the final crop is enclosed in red rectangle, the center of landmark locations is shown in red cross

Most manipulation methods discussed is section II-B are relatively new and constantly evolving. Given the inter dependency of manipulation methods and forensics datasets in a supervised learning setting and the large cost of training models for image manipulation there has unsurprisingly been a shortage of such datasets and benchmarks available.

The first publicly available dataset containing samples of manipulated faces in videos used the VidTIMIT dataset [vidtimit] which includes 10 videos for each one of the 43 participants. From those 32 participants were grouped in couples according to physical similarities and a publicly available implementation for Deepfakes [df] was used to produce 320 Deepfakes videos in two quality levels. [2stream] use publicly available implemetations of FaceSwap techniques [fs, swapme] to produce a dataset of 2010 manipulated images.

The FaceForensics dataset [facefor] includes a base of 1000 videos downloaded from YouTube [youtube8m]. All videos include talking faces, mostly taken from news channels, in frontal view without occlusions and of more than 300 consecutive frames. This set is processed with the Face2Face [f2f] algorithm to create a source-to-target reenactment dataset by randomly choosing source and target videos and a self reenactment dataset by applying Face2Face on the same video. Three datasets are produced in each case for uncompressed data and H.264 compressed data with quantization parameters 23, 40 (low and high levels of compression). As a byproduct of the manipulation method there are available ground truth masks indicating the manipulated region which can be used for forgery localization.
In all the manipulation detection experiments presented below we use the FaceForensics++ dataset [facefor++] which extends FaceForensics to include samples generated from Deepfakes [df] and FaceSwap [fs] and is currently the largest available forgery detection dataset. As in FaceForensics 1000 videos are generated per manipulation method in the same compression levels. To produce results that are comparable with earlier studies we follow the same partition of the dataset to 720 train, 140 evaluation and 140 test videos as in the original paper.
Additionally, pretrained networks based on [pgan] and [stylegan] were used to generate images of faces. From each technique 70k faces were selected such that a face could be detected using a reasonably high detection threshold and these were split into 56k, 7k, 7k train, validation and test sets respectively. For real images the Flickr-Faces-HQ (FFHQ) dataset [stylegan] of human faces was used. Since both networks were trained with FFHQ it was thought that this dataset’s statistics would differ as little as possible from the generated data and would provide the fairest estimate of a model’s ability to detect generated content. In total, the data used for generated image detection consist of 210k images, 168k, 21k, 21k for train, validation and test sets respectively. Even though these images were not extracted from videos, in order to examine the effect of compression on the detection performance all images were compressed using H.264 coding at levels similar to the ones used in the FaceForensics++ dataset.


Fig. 2: Network Architecture. Layers L0-L12 in gray represent feature tensors, convolution kernels are represented in blue. All dimensions can be found in Table I.

IV-A Data Preprocessing

Incorporating prior knowledge into the data preprocessing pipeline has been shown to improve the performance of trained models [facefor++]. For all the experiments that follow data preprocessing involves cropping the faces from video frames such that only the relevant information, ie. face region, is seen by the model. We apply the following steps to extract square face crops for all images data used in training and evaluation:

  1. 1.

    a pretrained face detector b,l=D(x) [retina_face] is used to regress face box parameters b=[bx,by,dx,dy] (box corner location and box dimensions) as well as five landmark locations li=[lx,i,ly,i],i[1,5] from an input frame x

  2. 2.

    we derive the crop box side dcrop=1.1dmax using the largest side of the cropbox dmax=max(dx,dy)

  3. 3.

    we calculate the center of the five facial landmark locations lm=15i=15li

  4. 4.

    using dcrop we extract a square region centered at lm and having an side of 2dcrop

  5. 5.

    in case the defined face region falls outside the image dimensions we concatenate with zeros such that the final shape is a square

  6. 6.

    the image is resized to the desired resolution

The process is shown in in Fig.1.

IV-B Model Architecture

In designing the model architecture presented below we follow the intuition first presented in [meso] that the required signal for the classification problem at hand comes from mesoscopic properties of images. Visual examination of images in the datasets used does not offer sufficient clues as to the authenticity of a face as there are many cases of manipulated images which cannot be identified based on appearance as well as real images which appear fake due to compression artifacts. As such we do not expect macroscopic appearance features to contain significant information for the task of manipulation detection. In a human study conducted in [facefor++] humans were shown to be well inferior of most of the models tested. On the other hand, microscopic features should contain artifacts specific to the manipulation method used, however, being local in nature we expected that information to be corrupted under heavy video compression.

The proposed network architecture can be seen in Fig.I with all relevant dimensions found in Table I. It consists of a feature extactor with 8 convolutional and 1 MaxPool layers (layers 1-9) and 2 separate classifiers (layers 10, 12) followed by SoftMax to derive class probabilities. All convolution steps are followed by ReLU activation [relu] and Batch Normalization [bn]. Convolutional filter strides are set as 1 and MaxPool stride is 2 for downsampling. No padding is used throughout the network.

TABLE I: Network Architecture
Layer Input Layer Type Filters Output RF
0 - Input - 128x128x3 1
1 0 Conv2d 3x3, s1 126x126x32 3
2 1 Conv2d 3x3, s1 124x124x32 5
3 2 MaxPool2d 3x3, s2 61x61x32 9
4 3 Conv2d 3x3, s1 59x59x64 13
5 4 Conv2d 3x3, s1 57x57x64 17
6 5 Conv2d 3x3, s1 55x55x128 21
7 6 Conv2d 3x3, s1 53x53x128 25
8 7 Conv2d 3x3, s1 51x51x256 29
9 8 Conv2d 3x3, s1 49x49x256 33
10 9 Conv2d* 1x1, s1 49x49x2 33
11 9 AvPool2d 49x49 1x1x256 128
12 11 Conv2d* 1x1, s1 1x1x2 128

The inspiration for the selected architecture comes from the hypothesis that all patches of a given size and larger should contain artifacts we can use to classify the image as manipulated or not. In that respect we view FCNs as an ensemble of CNNs on all image patches of size and stride controlled by the network architecture. In that view, the GAP operation assumes that all patches are expected to extract features relevant to the authenticity of each patch which are similar in nature and by taking the average of those values we expect the final features to be more robust.

IV-C Loss Function

To further enforce the contribution from individual patches we use a joint classification and segmentation objective function. For the the segmentation task we use available segmentation masks but do not employ deconvolution layers to match input dimensions, instead we sub-sample the mask at positions corresponding to the central pixels of each patch as seen by the model. Assuming a CNN with receptive field of drf×drf pixels at a stride s, an input image xNin×Nin×3, an output features map oNout×Nout×do, a one-hot segmentation mask mNin×Nin×2 and an output class probabilities map y^segNout×Nout×2 we only use the central portion mc of m from (drf2,drf2) to (Nin-drf2,Nin-drf2) (top left, bottom right coordinates) at stride s which has the same dimensions as y^seg as labels. The process of extracting labels for the segmentation task can be seen in Fig.3. We use these to define an average Cross-Entropy segmentation loss over all spatial positions:


At the same time we apply GAP [detcg] over o to derive the average output features om=i=0Noutj=0Noutoij and a linear layer followed by softmax to derive class probabilities for the image y^cls. Similarly, these are used to define a classification Cross-Entropy loss using labels y:


The joint loss function for the model is defined using hyperparameter λseg0,λseg1 as follows:


In that the extreme values of λseg=0 corresponds to training only for the classification task while λseg=1 corresponds to segmentation only training.

Fig. 3: Extracting mask for segmentation loss. (Left) Original segmentation mask in white superimposed over manipulated image, extreme top-bottom, left-right receptive field patches are shown in red, (Center) the central region of used for the segmentation mask enclosed in green rectangle, (Right) subsampled segmentation labels (stride 2) matching output dimensions used in training

IV-D Training and Evaluation

All networks were trained using the Adam optimizer [adam] with a learn rate of 10-3, an exponential learn rate decay of 0.90 applied at the end of every epoch and L2 regularization with λ=10-5. From the FaceForensics++ dataset we extracted one every five consecutive frames for a total of 102k pristine images. All training consisted of 50 epochs, calculating the classification accuracy every 1000 train steps and keeping the model with the highest validation classification accuracy overall. The test set is used for all results presented below.


V-A Image Manipulation Detection

TABLE II: Binary Classification Accuracy - (TOP) Manipulation Method-specific Training, (BOTTOM) Training with All Manipulation Methods
C23 Compression C40 Compression
λseg DF F2F FS DF F2F FS
0.0 96.80 97.72 97.57 91.60 84.47 89.72
0.2 97.90 98.37 97.90 91.68 84.88 89.69
0.3 97.86 98.58 98.32 92.40 86.32 89.77
0.4 97.86 98.44 98.10 91.95 86.20 90.15
0.5 97.75 98.43 98.24 91.80 86.40 90.23
0.6 97.55 98.45 98.09 91.83 87.11 90.56
0.7 97.51 98.38 98.17 91.71 86.81 91.26
XceptionNet [facefor++] 98.85 98.36 98.23 94.28 91.56 93.70
DenseNet [cnnrnn] - - - 96.70 93.21 95.80
0.0 94.54 94.73 94.16 85.92 82.89 83.49
0.2 95.64 96.28 95.41 86.44 83.05 84.27
0.3 95.77 96.23 95.97 85.96 83.96 84.82
0.4 95.81 96.79 95.51 87.06 83.52 84.55
0.5 96.14 96.54 95.96 86.23 83.74 84.19
0.6 96.78 96.91 95.97 85.88 84.70 85.17
0.7 96.65 97.13 95.92 86.83 84.98 84.77
XceptionNet [facefor++] 97.49 97.69 96.79 93.36 88.09 87.42
TABLE III: Binary Segmentation Accuracy - (TOP) Manipulation Method-specific Training, (BOTTOM) Training with All Manipulation Methods
C23 Compression C40 Compression
λseg DF F2F FS DF F2F FS
0.2 87.54 84.34 84.26 80.20 69.92 74.66
0.3 89.28 85.45 85.37 80.39 71.36 74.70
0.4 89.24 85.32 86.34 80.99 70.29 75.84
0.5 88.43 85.60 86.19 81.34 71.53 75.94
0.6 88.23 85.68 86.43 81.23 73.11 76.35
0.7 88.58 85.22 86.22 81.04 73.48 76.39
1.0 89.62 86.38 86.80 81.85 74.35 77.43
0.2 79.87 78.78 77.95 73.63 73.81 72.91
0.3 81.06 81.35 79.05 74.39 74.23 73.17
0.4 81.54 80.73 77.99 74.05 74.35 73.78
0.5 81.36 81.28 78.76 74.33 74.83 73.86
0.6 82.03 81.86 79.25 75.30 74.70 74.25
0.7 82.37 82.43 79.04 74.55 75.17 74.07
1.0 84.22 82.93 81.52 77.07 75.79 74.85
Fig. 4: Precision-Recall curves for models trained with λseg=1.0 individually trained for each manipulation method
TABLE IV: Binary Classification Accuracy
C23 Compression C40 Compression
0.0 99.96 99.55 99.76 97.66
0.2 99.97 99.55 99.77 97.59
0.3 99.98 99.59 99.81 97.60
0.4 99.96 99.55 99.73 97.32
0.5 99.94 99.52 99.78 97.28
0.6 99.97 99.56 99.74 97.43
0.7 99.94 99.50 99.74 97.27
0.0 99.66 99.23 97.31 95.04
0.2 99.62 99.32 97.21 94.78
0.3 99.63 99.31 97.33 94.89
0.4 99.65 99.32 97.02 94.94
0.5 99.53 99.33 97.17 94.75
0.6 99.61 99.20 96.99 94.84
0.7 99.65 99.24 97.03 94.89
Fig. 5: Image Segmentation results for C23 compression. Top Left: Deepfakes, Top Right: Face2Face, Bottom Left: FaceSwap, Bottom Right: Pristine.
Fig. 6: Image Segmentation results for C40 compression. Top Left: Deepfakes, Top Right: Face2Face, Bottom Left: FaceSwap, Bottom Right: Pristine. In each of image pairs, left image shows segmentation mask in white, right image shows predicted mask in red.

Test classification accuracies per manipulation method can be seen in Table II. Overall there is a noticeable performance boost by the participation of segmentation loss showing up to +2.6% improvement over the base case λseg=0 which is always the worst performer. We note that all models work well for c23 compression and achieve state of the art results (highlighted numbers) for Face2Face and FaceSwap using a significantly smaller network (1.2M trainable parameters) than [facefor++] (22M parameters). For higher compression we note a significant performance deterioration for all models. In every case performance drops when the training set includes samples from all manipulation methods as opposed to method-specific training.

Segmentation accuracies (per pixel classification) are shown in Table III. There apart from results shown for λseg=1.0 all other models were selected on the basis of best classification accuracy on the evaluation set. Given the additional selection bias for segmentation only trained models, it is no surprise that they outperform all other models, however, it must be noted in most cases best classification performance corresponded to the best choice for the segmentation task as well. Precision-Recall curves for models trained individually can be seen in Fig.4. Some examples of predicted segmentation masks from the test set can be seen in Fig.5, 6 for medium and high compression respectively. In many cases we note that the model is missing part of the chin area which is due to the extraction process defined in section IV-A.

V-B Image Generation Detection

Per class classification accuracies for the generated image detection task can be seen in Table IV. The trained networks perform better here, showing this to be an easier task than detecting manipulated images. We notice that the model performs worse for the more realistic StyleGan images, the difference from PGAN being especially pronounced for the C40 compressed images and a drop in accuracy when the training set is composed of samples from both methods. Including segmentation loss here does not found to improve results as there is no noticeable jump in performance when it is introduced as was the case for manipulated images.


In this paper we proposed a simple FCN architecture designed to exploit local patterns found in manipulated images. The same architecture is further trained and tested on a set of images extracted from GANs trained on human faces which proved to be an easier task for the models. Additionally, we defined a process for joint classification-segmentation training and shown that classification performance can greatly benefit by including large contribution of segmentation loss in the total objective function.
The trained models generally performed better for low compression videos. As finding manipulations in highly compressed images is likely to involve a large number of weak patterns and multiple feature transformation steps we suggest that deeper architectures should be more suitable for that task. Since the methodology proposed here can be readily applied with few alterations to training deeper architectures, we suggest this line of research as part of a future project. Moreover, since segmentation appears to be the more difficult of the two tasks and to improve classification we believe experimenting with more architectures specific to image segmentation to lead to improved results.