Abstract
When there is a mismatch between the target identity and the driver identity,face reenactment suffers severe degradation in the quality of the result,especially in a fewshot setting. The identity preservation problem, where themodel loses the detailed information of the target leading to a defectiveoutput, is the most common failure mode. The problem has several potentialsources such as the identity of the driver leaking due to the identitymismatch, or dealing with unseen large poses. To overcome such problems, weintroduce components that address the mentioned problem: image attention block,target feature alignment, and landmark transformer. Through attending andwarping the relevant features, the proposed architecture, called MarioNETte,produces highquality reenactments of unseen identities in a fewshot setting.In addition, the landmark transformer dramatically alleviates the identitypreservation problem by isolating the expression geometry through landmarkdisentanglement. Comprehensive experiments are performed to verify that theproposed framework can generate highly realistic faces, outperforming all otherbaselines, even under a significant mismatch of facial characteristics betweenthe target and the driver.
Quick Read (beta)
MarioNETte: Fewshot Face Reenactment Preserving Identity of Unseen Targets
Abstract
When there is a mismatch between the target identity and the driver identity, face reenactment suffers severe degradation in the quality of the result, especially in a fewshot setting. The identity preservation problem, where the model loses the detailed information of the target leading to a defective output, is the most common failure mode. The problem has several potential sources such as the identity of the driver leaking due to the identity mismatch, or dealing with unseen large poses. To overcome such problems, we introduce components that address the mentioned problem: image attention block, target feature alignment, and landmark transformer. Through attending and warping the relevant features, the proposed architecture, called MarioNETte, produces highquality reenactments of unseen identities in a fewshot setting. In addition, the landmark transformer dramatically alleviates the identity preservation problem by isolating the expression geometry through landmark disentanglement. Comprehensive experiments are performed to verify that the proposed framework can generate highly realistic faces, outperforming all other baselines, even under a significant mismatch of facial characteristics between the target and the driver.
MarioNETte: Fewshot Face Reenactment Preserving Identity of Unseen Targets
Sungjoo Ha^{†}^{†}thanks: Equal contributions, listed in alphabetical order., Martin Kersner^{1}^{1}footnotemark: 1 , Beomsu Kim^{1}^{1}footnotemark: 1 , Seokjun Seo^{1}^{1}footnotemark: 1 , Dongyoung Kim^{†}^{†}thanks: Corresponding author. Hyperconnect Seoul, Republic of Korea {shurain, martin.kersner, beomsu.kim, seokjun.seo, dongyoung.kim}@hpcnt.com
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Given a target face and a driver face, face reenactment aims to synthesize a reenacted face which is animated by the movement of a driver while preserving the identity of the target.
Many approaches make use of generative adversarial networks (GAN) which have demonstrated a great success in image generation tasks. ?; ? (?; ?) achieved highfidelity face reenactment results by exploiting CycleGAN (?). However, the CycleGANbased approaches require at least a few minutes of training data for each target and can only reenact predefined identities, which is less attractive inthewild where a reenactment of unseen targets cannot be avoided.
The fewshot face reenactment approaches, therefore, try to reenact any unseen targets by utilizing operations such as adaptive instance normalization (AdaIN) (?) or warping module (?; ?). However, current stateoftheart methods suffer from the problem we call identity preservation problem: the inability to preserve the identity of the target leading to defective reenactments. As the identity of the driver diverges from that of the target, the problem is exacerbated even further.
Examples of flawed and successful face reenactments, generated by previous approaches and the proposed model, respectively, are illustrated in Figure 1. The failures of previous approaches, for the most part, can be broken down into three different modes ^{1}^{1} 1 Additional example images and videos can be found at the following URL: http://hyperconnect.github.io/MarioNETte:

1.
Neglecting the identity mismatch may lead to a identity of the driver interfere with the face synthesis such that the generated face resembles the driver (Figure 1a).

2.
Insufficient capacity of the compressed vector representation (e.g., AdaIN layer) to preserve the information of the target identity may lead the produced face to lose the detailed characteristics (Figure 1b).

3.
Warping operation incurs a defect when dealing with large poses (Figure 1c).
We propose a framework called MarioNETte, which aims to reenact the face of unseen targets in a fewshot manner while preserving the identity without any finetuning. We adopt image attention block and target feature alignment, which allow MarioNETte to directly inject features from the target when generating image. In addition, we propose a novel landmark transformer which further mitigates the identity preservation problem by adjusting for the identity mismatch in an unsupervised fashion. Our contributions are as follows:

•
We propose a fewshot face reenactment framework called MarioNETte, which preserves the target identity even in situations where the facial characteristics of the driver differs widely from those of the target. Utilizing image attention block, which allows the model to attend to relevant positions of the target feature map, together with target feature alignment, which includes multiple featurelevel warping operations, proposed method improves the quality of the face reenactment under different identities.

•
We introduce a novel method of landmark transformation which copes with varying facial characteristics of different people. The proposed method adapts the landmark of a driver to that of the target in an unsupervised manner, thereby mitigating the identity preservation problem without any additional labeled data.

•
We compare the stateoftheart methods when the target and the driver identities coincide and differ using VoxCeleb1 (?) and CelebV (?) dataset, respectively. Our experiments including user studies show that the proposed method outperforms the stateoftheart methods.
MarioNETte Architecture
Figure 2 illustrates the overall architecture of the proposed model. A conditional generator $G$ generates the reenacted face given the driver $\mathbf{x}$ and the target images ${\{{\mathbf{y}}^{i}\}}_{i=1\mathrm{\dots}K}$, and the discriminator $D$ predicts whether the image is real or not. The generator consists of following components:

•
The preprocessor $P$ utilizes a 3D landmark detector (?) to extract facial keypoints and renders them to landmark image, yielding ${\mathbf{r}}_{x}=P(\mathbf{x})$ and ${\mathbf{r}}_{y}^{i}=P({\mathbf{y}}^{i})$, corresponding to the driver and the target input respectively. Note that proposed landmark transformer is included in the preprocessor. Since we normalize the scale, translation and rotation of landmarks before using them in a landmark transformer, we utilize 3D landmarks instead of 2D ones.

•
The driver encoder ${E}_{x}({\mathbf{r}}_{x})$ extracts pose and expression information from the driver input and produces driver feature map ${\mathbf{z}}_{x}$.

•
The target encoder ${E}_{y}(\mathbf{y},{\mathbf{r}}_{y})$ adopts a UNet architecture to extract style information from the target input and generates target feature map ${\mathbf{z}}_{y}$ along with the warped target feature maps $\widehat{\mathbf{S}}$.

•
The blender $B({\mathbf{z}}_{x},{\{{\mathbf{z}}_{y}^{i}\}}_{i=1\mathrm{\dots}K})$ receives driver feature map ${\mathbf{z}}_{x}$ and target feature maps ${\mathbf{Z}}_{y}=[{\mathbf{z}}_{y}^{1},\mathrm{\dots},{\mathbf{z}}_{y}^{K}]$ to produce mixed feature map ${\mathbf{z}}_{xy}$. Proposed image attention block is basic building block of the blender.

•
The decoder $Q({\mathbf{z}}_{xy},{\{{\widehat{\mathbf{S}}}^{i}\}}_{i=1\mathrm{\dots}K})$ utilizes warped target feature maps $\widehat{\mathbf{S}}$ and mixed feature map ${\mathbf{z}}_{xy}$ to synthesize reenacted image. The decoder improves quality of reenacted image exploiting proposed target feature alignment.
For further details, refer to Supplementary Material A1.
Image attention block
To transfer style information of targets to the driver, previous studies encoded target information as a vector and mixed it with driver feature by concatenation or AdaIN layers (?; ?). However, encoding targets as a spatialagnostic vector leads to losing spatial information of targets. In addition, these methods are absent of innate design for multiple target images, and thus, summary statistics (e.g. mean or max) are used to deal with multiple targets which might cause losing details of the target.
We suggest image attention block (Figure 3) to alleviate aforementioned problem. The proposed attention block is inspired by the encoderdecoder attention of transformer (?), where the driver feature map acts as an attention query and the target feature maps act as attention memory. The proposed attention block attends to proper positions of each feature (red boxes in Figure 3) while handling multiple target feature maps (i.e., ${\mathbf{Z}}_{y}$).
Given driver feature map ${\mathbf{z}}_{x}\in {\mathbb{R}}^{{h}_{x}\times {w}_{x}\times {c}_{x}}$ and target feature maps ${\mathbf{Z}}_{y}=[{\mathbf{z}}_{y}^{1},\mathrm{\dots},{\mathbf{z}}_{y}^{K}]\in {\mathbb{R}}^{K\times {h}_{y}\times {w}_{y}\times {c}_{y}}$, the attention is calculated as follows:
$\mathbf{Q}$  $={\mathbf{z}}_{x}{\mathbf{W}}_{q}+{\mathbf{P}}_{x}{\mathbf{W}}_{qp}$  $\in $  ${\mathbb{R}}^{{h}_{x}\times {w}_{x}\times {c}_{a}}$  (1)  
$\mathbf{K}$  $={\mathbf{Z}}_{y}{\mathbf{W}}_{k}+{\mathbf{P}}_{y}{\mathbf{W}}_{kp}$  $\in $  ${\mathbb{R}}^{K\times {h}_{y}\times {w}_{y}\times {c}_{a}}$  
$\mathbf{V}$  $={\mathbf{Z}}_{y}{\mathbf{W}}_{v}$  $\in $  ${\mathbb{R}}^{K\times {h}_{y}\times {w}_{y}\times {c}_{x}}$ 
$$A(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}\left(\frac{f(\mathbf{Q})f{(\mathbf{K})}^{T}}{\sqrt{{c}_{a}}}\right)f(\mathbf{V}),$$  (2) 
where $f:{\mathbb{R}}^{{d}_{1}\times \mathrm{\dots}\times {d}_{k}\times c}\stackrel{}{\to}{\mathbb{R}}^{({d}_{1}\times \mathrm{\dots}\times {d}_{k})\times c}$ is a flattening function, all $\mathbf{W}$ are linear projection matrices that map to proper number of channels at the last dimension, and ${\mathbf{P}}_{x}$ and ${\mathbf{P}}_{y}$ are sinusoidal positional encodings which encode the coordinate of feature maps (further details of sinusoidal positional encodings we used are described in Supplementary Material A2). Finally, the output $A(\mathbf{Q},\mathbf{K},\mathbf{V})\in {\mathbb{R}}^{({h}_{x}\times {w}_{x})\times {c}_{x}}$ is reshaped to ${\mathbb{R}}^{{h}_{x}\times {w}_{x}\times {c}_{x}}$.
Instance normalization, residual connection, and convolution layer follow the attention layer to generate output feature map ${\mathbf{z}}_{xy}$. The image attention block offers a direct mechanism of transferring information from multiple target images to the pose of driver.
Target feature alignment
The finegrained details of the target identity can be preserved through the warping of lowlevel features (?). Unlike previous approaches that estimate a warping flow map or an affine transform matrix by computing the difference between keypoints of the target and the driver (?; ?; ?), we propose a target feature alignment (Figure 4) which warps the target feature maps in two stages: (1) target pose normalization generates pose normalized target feature maps and (2) driver pose adaptation aligns normalized target feature maps to the pose of the driver. The twostage process allows the model to better handle the structural disparities of different identities. The details are as follows:

1.
Target pose normalization. In the target encoder ${E}_{y}$, encoded feature maps ${\{{\mathbf{S}}_{j}\}}_{j=1\mathrm{\dots}{n}_{y}}$ are processed into $\widehat{\mathbf{S}}=\{\mathcal{T}({\mathbf{S}}_{1};{\mathbf{f}}_{y}),\mathrm{\dots},\mathcal{T}({\mathbf{S}}_{{n}_{y}};{\mathbf{f}}_{y})\}$ by estimated normalization flow map ${\mathbf{f}}_{y}$ of target and warping function $\mathcal{T}$ (\⃝raisebox{0.9pt}{1} in Figure 4). The following warpalignment block at decoder treats $\widehat{\mathbf{S}}$ in a target poseagnostic manner.

2.
Driver pose adaptation. The warpalignment block in the decoder receives ${\{{\widehat{\mathbf{S}}}^{i}\}}_{i=1\mathrm{\dots}K}$ and the output $\mathbf{u}$ of the previous block of the decoder. In a fewshot setting, we average resolutioncompatible feature maps from different target images (i.e., ${\widehat{\mathbf{S}}}_{j}={\sum}_{i}{\widehat{\mathbf{S}}}_{j}^{i}/K$). To adapt posenormalized feature maps to the pose of the driver, we generate an estimated flow map of the driver ${\mathbf{f}}_{u}$ using $1\times 1$ convolution that takes $\mathbf{u}$ as the input. Alignment by $\mathcal{T}({\widehat{\mathbf{S}}}_{j};{\mathbf{f}}_{u})$ follows (\⃝raisebox{0.9pt}{2} in Figure 4). Then, the result is concatenated to $\mathbf{u}$ and fed into the following residual upsampling block.
Landmark Transformer
Large structural differences between two facial landmarks may lead to severe degradation of the quality of the reenactment. The usual approach to such a problem has been to learn a transformation for every identity (?) or by preparing a paired landmark data with the same expressions (?). However, these methods are unnatural in a fewshot setting where we handle unseen identities, and moreover, the labeled data is hard to be acquired. To overcome this difficulty, we propose a novel landmark transformer which transfers the facial expression of the driver to an arbitrary target identity. The landmark transformer utilizes multiple videos of unlabeled human faces and is trained in an unsupervised manner.
Landmark decomposition
Given video footages of different identities, we denote $\mathbf{x}(c,t)$ as the $t$th frame of the $c$th video, and $\mathbf{l}(c,t)$ as a 3D facial landmark. We first transform every landmark into a normalized landmark $\overline{\mathbf{l}}(c,t)$ by normalizing the scale, translation, and rotation. Inspired by 3D morphable models of face (?), we assume that normalized landmarks can be decomposed as follows:
$\overline{\mathbf{l}}(c,t)$  $={\overline{\mathbf{l}}}_{m}+{\overline{\mathbf{l}}}_{id}(c)+{\overline{\mathbf{l}}}_{exp}(c,t),$  (3) 
where ${\overline{\mathbf{l}}}_{m}$ is the average facial landmark geometry computed by taking the mean over all landmarks, ${\overline{\mathbf{l}}}_{id}(c)$ denotes the landmark geometry of identity $c$, computed by ${\overline{\mathbf{l}}}_{id}(c)={\sum}_{t}\overline{\mathbf{l}}(c,t)/{T}_{c}{\overline{\mathbf{l}}}_{m}$ where ${T}_{c}$ is the number of frames of $c$th video, and ${\overline{\mathbf{l}}}_{exp}(c,t)$ corresponds to the expression geometry of $t$th frame. The decomposition leads to ${\overline{\mathbf{l}}}_{exp}(c,t)=\overline{\mathbf{l}}(c,t){\overline{\mathbf{l}}}_{m}{\overline{\mathbf{l}}}_{id}(c)$.
Given a target landmark $\overline{\mathbf{l}}({c}_{y},{t}_{y})$ and a driver landmark $\overline{\mathbf{l}}({c}_{x},{t}_{x})$ we wish to generate the following landmark:
$$\overline{\mathbf{l}}({c}_{x}\stackrel{}{\to}{c}_{y},{t}_{x})={\overline{\mathbf{l}}}_{m}+{\overline{\mathbf{l}}}_{id}({c}_{y})+{\overline{\mathbf{l}}}_{exp}({c}_{x},{t}_{x}),$$  (4) 
i.e., a landmark with the identity of the target and the expression of the driver. Computing ${\overline{\mathbf{l}}}_{id}({c}_{y})$ and ${\overline{\mathbf{l}}}_{exp}$ is possible if enough images of ${c}_{y}$ are given, but in a fewshot setting, it is difficult to disentangle landmark of unseen identity into two terms.
Landmark disentanglement
To decouple the identity and the expression geometry in a fewshot setting, we introduce a neural network to regress the coefficients for linear bases. Previously, such an approach has been widely used in modeling complex face geometries (?). We separate expression landmarks into semantic groups of the face (e.g., mouth, nose and eyes) and perform PCA on each group to extract the expression bases from the training data:
$${\overline{\mathbf{l}}}_{exp}(c,t)=\sum _{k=1}^{{n}_{exp}}{\alpha}_{k}(c,t){\mathbf{b}}_{exp,k}={\mathbf{b}}_{exp}^{T}\bm{\alpha}(c,t),$$  (5) 
where ${\mathbf{b}}_{exp,k}$ and ${\alpha}_{k}$ represent the basis and the corresponding coefficient, respectively.
The proposed neural network, a landmark disentangler $M$, estimates $\bm{\alpha}(c,t)$ given an image $\mathbf{x}(c,t)$ and a landmark $\overline{\mathbf{l}}(c,t)$. Figure 5 illustrates the architecture of the landmark disentangler. Once the model is trained, the identity and the expression geometry can be computed as follows:
$\widehat{\bm{\alpha}}(c,t)$  $=M(\mathbf{x}(c,t),\overline{\mathbf{l}}(c,t))$  (6)  
${\widehat{\mathbf{l}}}_{exp}(c,t)$  $={\lambda}_{exp}{\mathbf{b}}_{exp}^{T}\widehat{\bm{\alpha}}(c,t)$  
${\widehat{\mathbf{l}}}_{id}(c)$  $=\overline{\mathbf{l}}(c,t){\overline{\mathbf{l}}}_{m}{\widehat{\mathbf{l}}}_{exp}(c,t),$ 
where ${\lambda}_{exp}$ is a hyperparameter that controls the intensity of the predicted expressions from the network. Image feature extracted by a ResNet50 and the landmark, $\overline{\mathbf{l}}(c,t){\overline{\mathbf{l}}}_{m}$, are fed into a 2layer MLP to predict $\widehat{\bm{\alpha}}(c,t)$.
During the inference, the target and the driver landmarks are processed according to Equation 6. When multiple target images are given, we take the mean value over all ${\widehat{\mathbf{l}}}_{id}({c}_{y})$. Finally, landmark transformer converts landmark as:
$$\widehat{\mathbf{l}}({c}_{x}\stackrel{}{\to}{c}_{y},{t}_{x})={\overline{\mathbf{l}}}_{m}+{\widehat{\mathbf{l}}}_{id}({c}_{y})+{\widehat{\mathbf{l}}}_{exp}({c}_{x},{t}_{x}).$$  (7) 
Denormalization to recover the original scale, translation, and rotation is followed by the rasterization that generates a landmark adequate for the generator to consume. Further details of landmark transformer are described in Supplementary Material B.
Experimental Setup
Datasets
We trained our model and the baselines using VoxCeleb1 (?), which contains $256\times 256$ size videos of 1,251 different identities. We utilized the test split of VoxCeleb1 and CelebV (?) for evaluating selfreenactment and reenactment under a different identity, respectively. We created the test set by sampling 2,083 image sets from randomly selected 100 videos of VoxCeleb1 test split, and uniformly sampled 2,000 image sets from every identity from CelebV. The CelebV data includes the videos of five different celebrities of widely varying characteristics, which we utilize to evaluate the performance of the models reenacting unseen targets, similar to inthewild scenario. Further details of the loss function and the training method can be found at Supplementary Material A3 and A4.
Baselines
MarioNETte variants, with and without the landmark transformer (MarioNETte+LT and MarioNETte, respectively), are compared with stateoftheart models for fewshot face reenactment. Details of each baseline are as follows:

•
X2Face (?). X2face utilizes direct image warping. We used the pretrained model provided by the authors, trained on VoxCeleb1.

•
MonkeyNet (?). MonkeyNet adopts featurelevel warping. We used the implementation provided by the authors. Due to the structure of the method, MonkeyNet can only receive a single target image.

•
NeuralHead (?). NeuralHead exploits AdaIN layers. Since a reference implementation is absent, we made an honest attempt to reproduce the results. Our implementation is a feedforward version of their model (NeuralHeadFF) where we omit the metalearning as well as finetuning phase, because we are interested in using a single model to deal with multiple identities.
Metrics
We compare the models based on the following metrics to evaluate the quality of the generated images. Structured similarity (SSIM) (?) and peak signaltonoise ratio (PSNR) evaluate the lowlevel similarity between the generated image and the groundtruth image. We also report the maskedSSIM (MSSIM) and maskedPSNR (MPSNR) where the measurements are restricted to the facial region.
In the absence of the ground truth image where different identity drives the target face, the following metrics are more relevant. Cosine similarity (CSIM) of embedding vectors generated by pretrained face recognition model (?) is used to evaluate the quality of identity preservation. To inspect the capability of the model to properly reenact the pose and the expression of the driver, we compute PRMSE, the root mean square error of the head pose angles, and AUCON, the ratio of identical facial action unit values, between the generated images and the driving images. OpenFace (?) is utilized to compute pose angles and action unit values.
Experimental Results
Models were compared under selfreenactment and reenactment of different identities, including a user study. Ablation tests were conducted as well. All experiments were conducted under two different settings: oneshot and fewshot, where one or eight target images were used respectively.
Selfreenactment
Model (# target)  CSIM$\uparrow $  SSIM$\uparrow $  MSSIM$\uparrow $  PSNR$\uparrow $  MPSNR$\uparrow $  PRMSE$\downarrow $  AUCON$\uparrow $ 
X2face (1)  0.689  0.719  0.941  22.537  31.529  3.26  0.813 
MonkeyNet (1)  0.697  0.734  0.934  23.472  30.580  3.46  0.770 
NeuralHeadFF (1)  0.229  0.635  0.923  20.818  29.599  3.76  0.791 
MarioNETte (1)  0.755  0.744  0.948  23.244  32.380  3.13  0.825 
X2face (8)  0.762  0.776  0.956  24.326  33.328  3.21  0.826 
NeuralHeadFF (8)  0.239  0.645  0.925  21.362  29.952  3.69  0.795 
MarioNETte (8)  0.828  0.786  0.958  24.905  33.645  2.57  0.850 
Table 1 illustrates the evaluation results of the models under selfreenactment settings on VoxCeleb1. MarioNETte surpasses other models in every metric under fewshot setting and outperforms other models in every metric except for PSNR under the oneshot setting. However, MarioNETte shows the best performance in MPSNR which implies that it performs better on facial region compared to baselines. The low CSIM yielded from NeuralHeadFF is an indirect evidence of the lack of capacity in AdaINbased methods.
Reenacting Different Identity
Model (# target)  CSIM$\uparrow $  PRMSE$\downarrow $  AUCON$\uparrow $ 

X2face (1)  0.450  3.62  0.679 
MonkeyNet (1)  0.451  4.81  0.584 
NeuralHeadFF (1)  0.108  3.30  0.722 
MarioNETte (1)  0.520  3.41  0.710 
MarioNETte+LT (1)  0.568  3.70  0.684 
X2face (8)  0.484  3.15  0.709 
NeuralHeadFF (8)  0.120  3.26  0.723 
MarioNETte (8)  0.608  3.26  0.717 
MarioNETte+LT (8)  0.661  3.57  0.691 
Model (# target) 


Realism $\uparrow $  

X2Face (1)  0.07  0.09  0.093  
MonkeyNet (1)  0.05  0.09  0.100  
NeuralHeadFF (1)  0.17  0.17  0.087  
MarioNETte (1)    0.51  0.140  
MarioNETte+LT (1)      0.187  
X2Face (8)  0.09  0.07  0.047  
NeuralHeadFF (8)  0.15  0.16  0.080  
MarioNETte (8)    0.52  0.147  
MarioNETte+LT (8)      0.280 
Model (# target)  CSIM$\uparrow $  PRMSE$\downarrow $  AUCON$\uparrow $ 
AdaIN (1)  0.063  3.47  0.724 
+Attention (1)  0.333  3.17  0.729 
+Alignment (1)  0.530  3.44  0.700 
MarioNETte (1)  0.520  3.41  0.710 
AdaIN (8)  0.069  3.40  0.723 
+Attention (8)  0.472  3.22  0.727 
+Alignment (8)  0.605  3.27  0.709 
MarioNETte (8)  0.608  3.26  0.717 
Table 2 displays the evaluation result of reenacting a different identity on CelebV, and Figure 6 shows generated images from proposed method and baselines. MarioNETte and MarioNETte+LT preserve target identity adequately, thereby outperforming other models in CSIM. The proposed method alleviates the identity preservation problem regardless of the driver being of the same identity or not. While NeuralHeadFF exhibits slightly better performance in terms of PRMSE and AUCON compared to MarioNETte, the low CSIM of NeuralHeadFF portrays the failure to preserve the target identity. The landmark transformer significantly boosts identity preservation at the cost of a slight decrease in PRMSE and AUCON. The decrease may be due to the PCA bases for the expression disentanglement not being diverse enough to span the whole space of expressions. Moreover, the disentanglement of identity and expression itself is a nontrivial problem, especially in a oneshot setting.
User Study
Two types of user studies are conducted to assess the performance of the proposed model:

•
Comparative analysis. Given three example images of the target and a driver image, we displayed two images generated by different models and asked human evaluators to select an image with higher quality. The users were asked to assess the quality of an image in terms of (1) identity preservation, (2) reenactment of driver’s pose and expression, and (3) photorealism. We report the winning ratio of baseline models compared to our proposed models. We believe that user reported score better reflects the quality of different models than other indirect metrics.

•
Realism analysis. Similar to the user study protocol of ? (?), three images of the same person, where two of the photos were taken from a video and the remaining generated by the model, were presented to human evaluators. Users were instructed to choose an image that differs from the other two in terms of the identity under a threesecond time limit. We report the ratio of deception, which demonstrates the identity preservation and the photorealism of each model.
For both studies, 150 examples were sampled from CelebV, which were evenly distributed to 100 different human evaluators.
Table 3 illustrates that our models are preferred over existing methods achieving realism scores with a large margin. The result demonstrates the capability of MarioNETte in creating photorealistic reenactments while preserving the target identity in terms of human perception. We see a slight preference of MarioNETte over MarioNETte+LT, which agrees with the Table 2, as MarioNETte+LT has better identity preservation capability at the expense of slight degradation in expression transfer. Since the identity preservation capability of MarioNETte+LT surpasses all other models in realism score, almost twice the score of even MarioNETte on fewshot settings, we consider the minor decline in expression transfer a good compromise.
Ablation Test
We performed ablation test to investigate the effectiveness of the proposed components. While keeping all other things the same, we compare the following configurations reenacting different identities: (1) MarioNETte is the proposed method where both image attention block and target feature alignment are applied. (2) AdaIN corresponds to the same model as MarioNETte, where the image attention block is replaced with AdaIN residual block while the target feature alignment is omitted. (3) +Attention is a MarioNETte where only the image attention block is applied. (4) +Alignment only employs the target feature alignment.
Table 4 shows result of ablation test. For identity preservation (i.e., CSIM), AdaIN has a hard time combining style features depending solely on AdaIN residual blocks. +Attention alleviates the problem immensely in both oneshot and fewshot settings by attending to proper coordinates. While +Alignment exhibits a higher CSIM compared to +Attention, it struggles in generating plausible images for unseen poses and expressions leading to worse PRMSE and AUCON. Taking advantage of both attention and target feature alignment, MarioNETte outperforms +Alignment in every metric under consideration.
Entirely relying on target feature alignment for reenactment, +Alignment is vulnerable to failures due to large differences in pose between target and driver that MarioNETte can overcome. Given a single driver image along with three target images (Figure 7a), +Alignment has defects on the forehead (denoted by arrows in Figure 7b). This is due to (1) warping lowlevel features from a largepose input and (2) aggregating features from multiple targets with diverse poses. MarioNETte, on the other hand, gracefully handles the situation by attending to proper image among several target images as well as adequate spatial coordinates in the target image. The attention map, highlighting the area where the image attention block is focusing on, is illustrated with white in Figure 7a. Note that MarioNETte attends to the forehead and adequate target images (Target 2 and 3 in Figure 7a) which has similar pose with driver.
Related Works
The classical approach to face reenactment commonly involves the use of explicit 3D modeling of human faces (?) where the 3DMM parameters of the driver and the target are computed from a single image, and blended eventually (?; ?). Image warping is another popular approach where the target image is modified using the estimated flow obtained form 3D models (?) or sparse landmarks (?). Face reenactment studies have embraced the recent success of neural networks exploring different imagetoimage translation architectures (?) such as the works of ? (?) and that of ? (?), which combined the cycle consistency loss (?). A hybrid of two approaches has been studied as well. ? (?) trained an image translation network which maps reenacted render of a 3D face model into a photorealistic output.
Architectures, capable of blending the style information of the target with the spatial information of the driver, have been proposed recently. AdaIN (?; ?; ?) layer, attention mechanism (?; ?; ?), deformation operation (?; ?), and GANbased method (?) have all seen a wide adoption. Similar idea has been applied to fewshot face reenactment settings such as the use of imagelevel (?) and featurelevel (?) warping, and AdaIN layer in conjuction with a metalearning (?). The identity mismatch problem has been studied through methods such as CycleGANbased landmark transformers (?) and landmark swappers (?). While effective, these methods either require an independent model per person or a dataset with image pairs that may be hard to acquire.
Conclusions
In this paper, we have proposed a framework for fewshot face reenactment. Our proposed image attention block and target feature alignment, together with the landmark transformer, allow us to handle the identity mismatch caused by using the landmarks of a different person. Proposed method do not need additional finetuning phase for identity adaptation, which significantly increases the usefulness of the model when deployed inthewild. Our experiments including human evaluation suggest the excellence of the proposed method.
One exciting avenue for future work is to improve the landmark transformer to better handle the landmark disentanglement to make the reenactment even more convincing.
References
Supplemental Materials
Appendix A MarioNETte Architecture Details
Architecture design
Given a driver image $\mathbf{x}$ and $K$ target images ${\{{\mathbf{y}}^{i}\}}_{i=1\mathrm{\dots}K}$, the proposed fewshot face reenactment framework which we call MarioNETte first generates 2D landmark images (i.e. ${\mathbf{r}}_{x}$ and ${\{{\mathbf{r}}_{y}^{i}\}}_{i=1\mathrm{\dots}K}$). We utilize a 3D landmark detector $\mathcal{K}:{\mathbb{R}}^{h\times w\times 3}\stackrel{}{\to}{\mathbb{R}}^{68\times 3}$ (?) to extract facial keypoints which includes information about pose and expression denoted as ${\mathbf{l}}_{x}=\mathcal{K}(\mathbf{x})$ and ${\mathbf{l}}_{y}^{i}=\mathcal{K}({\mathbf{y}}^{i})$, respectively. We further rasterize 3D landmarks to an image by rasterizer $\mathcal{R}$, resulting in ${\mathbf{r}}_{x}=\mathcal{R}({\mathbf{l}}_{x}),{\mathbf{r}}_{y}^{i}=\mathcal{R}({\mathbf{l}}_{y}^{i})$.
We utilize simple rasterizer that orthogonally projects 3D landmark points, e.g., $(x,y,z)$, into 2D $XY$plane, e.g., $(x,y)$, and we group the projected landmarks into 8 categories: left eye, right eye, contour, nose, left eyebrow, right eyebrow, inner mouth, and outer mouth. For each group, lines are drawn between predefined order of points with predefined colors (e.g., red, red, green, blue, yellow, yellow, cyan, and cyan respectively), resulting in a rasterized image as shown in Figure 8.
MarioNETte consists of conditional image generator $G({\mathbf{r}}_{x};{\{{\mathbf{y}}^{i}\}}_{i=1\mathrm{\dots}K},{\{{\mathbf{r}}_{y}^{i}\}}_{i=1\mathrm{\dots}K})$ and projection discriminator $D(\widehat{\mathbf{x}},\widehat{\mathbf{r}},c)$. The discriminator $D$ determines whether the given image $\widehat{\mathbf{x}}$ is a real image from the data distribution taking into account the conditional input of the rasterized landmarks $\widehat{\mathbf{r}}$ and identity $c$.
The generator $G({\mathbf{r}}_{x};{\{{\mathbf{y}}^{i}\}}_{i=1\mathrm{\dots}K},{\{{\mathbf{r}}_{y}^{i}\}}_{i=1\mathrm{\dots}K})$ is further broken down into four components: namely, target encoder, drvier encoder, blender, and decoder. Target encoder ${E}_{y}(\mathbf{y},{\mathbf{r}}_{y})$ takes target image and generates encoded target feature map ${\mathbf{z}}_{y}$ together with the warped target feature map $\widehat{\mathbf{S}}$. Driver encoder ${E}_{x}({\mathbf{r}}_{x})$ receives a driver image and creates a driver feature map ${\mathbf{z}}_{x}$. Blender $B({\mathbf{z}}_{x},{\{{\mathbf{z}}_{y}^{i}\}}_{i=1\mathrm{\dots}K})$ combines encoded feature maps to produce a mixed feature map ${\mathbf{z}}_{xy}$. Decoder $Q({\mathbf{z}}_{xy},{\{{\widehat{\mathbf{S}}}^{i}\}}_{i=1\mathrm{\dots}K})$ generates the reenacted image. Input image $\mathbf{y}$ and the landmark image ${\mathbf{r}}_{y}$ are concatenated channelwise and fed into the target encoder.
The target encoder ${E}_{y}(\mathbf{y},{\mathbf{r}}_{y})$ adopts a UNet (?) style architecture including five downsampling blocks and four upsampling blocks with skip connections. Among five feature maps ${\{{\mathbf{s}}_{j}\}}_{j=1\mathrm{\dots}5}$ generated by the downsampling blocks, the most downsampled feature map, ${\mathbf{s}}_{5}$, is used as the encoded target feature map ${\mathbf{z}}_{y}$, while the others, ${\{{\mathbf{s}}_{j}\}}_{j=1\mathrm{\dots}4}$, are transformed into normalized feature maps. A normalization flow map ${\mathbf{f}}_{y}\in {\mathbb{R}}^{(h/2)\times (w/2)\times 2}$ transforms each feature map into normalized feature map, $\widehat{\mathbf{S}}={\{{\widehat{\mathbf{s}}}_{j}\}}_{j=1\mathrm{\dots}4}$, through warping function $\mathcal{T}$ as follows:
$$\widehat{\mathbf{S}}=\{\mathcal{T}({\mathbf{s}}_{1};{\mathbf{f}}_{y}),\mathrm{\dots},\mathcal{T}({\mathbf{s}}_{4};{\mathbf{f}}_{y})\}.$$  (8) 
Flow map ${\mathbf{f}}_{y}$ is generated at the end of upsampling blocks followed by an additional convolution layer and a hyperbolic tangent activation layer, thereby producing a 2channel feature map, where each channel denotes a flow for the horizontal and vertical direction, respectively.
We adopt bilinear sampler based warping function which is widely used along with neural networks due to its differentiability (?; ?; ?). Since each ${\mathbf{s}}_{j}$ has a different width and height, average pooling is applied to downsample ${\mathbf{f}}_{y}$ to match the size of ${\mathbf{f}}_{y}$ to that of ${\mathbf{s}}_{j}$.
The driver encoder ${E}_{x}({\mathbf{r}}_{x})$, which consists of four residual downsampling blocks, takes driver landmark image ${\mathbf{r}}_{x}$ and generates driver feature map ${\mathbf{z}}_{x}$.
The blender $B({\mathbf{z}}_{x},{\{{\mathbf{z}}_{y}^{i}\}}_{i=1\mathrm{\dots}K})$ produces mixed feature map ${\mathbf{z}}_{xy}$ by blending the positional information of ${\mathbf{z}}_{x}$ with the target style feature maps ${\mathbf{z}}_{y}$. We stacked three image attention blocks to build our blender.
The decoder $Q({\mathbf{z}}_{xy},{\{{\widehat{\mathbf{S}}}^{i}\}}_{i=1\mathrm{\dots}K})$ consists of four warpalignment blocks followed by residual upsampling blocks. Note that the last upsampling block is followed by an additional convolution layer and a hyperbolic tangent activation function.
The discriminator $D(\widehat{\mathbf{x}},\widehat{\mathbf{r}},c)$ consists of five residual downsampling blocks without selfattention layers. We adopt a projection discriminator with a slight modification of removing the global sumpooling layer from the original structure. By removing the global sumpooling layer, discriminator generates scores on multiple patches like PatchGAN discriminator (?).
We adopt the residual upsampling and downsampling block proposed by ? (?) to build our networks. All batch normalization layers are substituted with instance normalization except for the target encoder and the discriminator, where the normalization layer is absent. We utilized ReLU as an activation function. The number of channels is doubled (or halved) when the output is downsampled (or upsampled). The minimum number of channels is set to 64 and the maximum number of channels is set to 512 for every layer. Note that the input image, which is used as an input for the target encoder, driver encoder, and discriminator, is first projected through a convolutional layer to match the channel size of 64.
Positional encoding
We utilize a sinusoidal positional encoding introduced by ? (?) with a slight modification. First, we divide the number of channels of the positional encoding in half. Then, we utilize half of them to encode the horizontal coordinate and the rest of them to encode the vertical coordinate. To encode the relative position, we normalize the absolute coordinate by the width and the height of the feature map. Thus, given a feature map of $\mathbf{z}\in {\mathbb{R}}^{{h}_{z}\times {w}_{z}\times {c}_{z}}$, the corresponding positional encoding $\mathbf{P}\in {\mathbb{R}}^{{h}_{z}\times {w}_{z}\times {c}_{z}}$ is computed as follows:
${\mathbf{P}}_{i,j,4k}=$  $\mathrm{sin}\left({\displaystyle \frac{256i}{{h}_{z}\cdot {10000}^{2k/{c}_{z}}}}\right)$  (9)  
${\mathbf{P}}_{i,j,4k+1}=$  $\mathrm{cos}\left({\displaystyle \frac{256i}{{h}_{z}\cdot {10000}^{2k/{c}_{z}}}}\right)$  
${\mathbf{P}}_{i,j,4k+2}=$  $\mathrm{sin}\left({\displaystyle \frac{256j}{{w}_{z}\cdot {10000}^{2k/{c}_{z}}}}\right)$  
${\mathbf{P}}_{i,j,4k+3}=$  $\mathrm{cos}\left({\displaystyle \frac{256j}{{w}_{z}\cdot {10000}^{2k/{c}_{z}}}}\right).$ 
Loss functions
Our model is trained in an adversarial manner using a projection discriminator $D$ (?). The discriminator aims to distinguish between the real image of the identity $c$ and a synthesized image of $c$ generated by $G$. Since the paired target and the driver images from different identities cannot be acquired without explicit annotation, we trained our model using the target and the driver image extracted from the same video. Thus, identities of $\mathbf{x}$ and ${\mathbf{y}}^{i}$ are always the same, e.g., $c$, for every target and driver image pair, i.e., $(\mathbf{x},{\{{\mathbf{y}}^{i}\}}_{i=1\mathrm{\dots}K})$, during the training.
We use hinge GAN loss (?) to optimize discriminator $D$ as follows:
$\widehat{\mathbf{x}}=$  $G({\mathbf{r}}_{x};\{{\mathbf{y}}^{i}\},\{{\mathbf{r}}_{y}^{i}\})$  (10)  
${\mathcal{L}}_{D}=$  $\text{max}(0,1D(\mathbf{x},{\mathbf{r}}_{x},c))\mathit{\hspace{1em}}+$  
$\text{max}(0,1+D(\widehat{\mathbf{x}},{\mathbf{r}}_{x},c)).$ 
The loss function of the generator consists of four components including the GAN loss ${\mathcal{L}}_{GAN}$, the perceptual losses (${\mathcal{L}}_{P}$ and ${\mathcal{L}}_{PF}$), and the feature matching loss ${\mathcal{L}}_{FM}$. The GAN loss ${\mathcal{L}}_{GAN}$ is a generator part of the hinge GAN loss and defined as follows:
${\mathcal{L}}_{GAN}=D(\widehat{\mathbf{x}},{\mathbf{r}}_{x},c).$  (11) 
The perceptual loss (?) is calculated by averaging ${L}_{1}$distances between the intermediate features of the pretrained network using ground truth image $\mathbf{x}$ and the generated image $\widehat{\mathbf{x}}$. We use two different networks for perceptual losses where ${\mathcal{L}}_{P}$ and ${\mathcal{L}}_{PF}$ are extracted from VGG19 and VGGVD16 each trained for ImageNet classification task (?) and a face recognition task (?), respectively. We use features from the following layers to compute the perceptual losses: relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1. Feature matching loss ${\mathcal{L}}_{FM}$ is the sum of ${L}_{1}$distances between the intermediate features of the discriminator $D$ when processing the ground truth image $\mathbf{x}$ and the generated image $\widehat{\mathbf{x}}$ which helps with the stabilization of the adversarial training. It helps to stabilize the adversarial training. The overall generator loss is the weighted sum of the four losses:
$${\mathcal{L}}_{G}={\mathcal{L}}_{GAN}+{\lambda}_{P}{\mathcal{L}}_{P}+{\lambda}_{PF}{\mathcal{L}}_{PF}+{\lambda}_{FM}{\mathcal{L}}_{FM}.$$  (12) 
Training details
To stabilize the adversarial training, we apply spectral normalization (?) for every layer of the discriminator and the generator. In addition, we use the convex hull of the facial landmarks as a facial region mask and give threefold weights to the corresponding masked position while computing the perceptual loss. We use Adam optimizer to train our model where the learning rate of $2\times {10}^{4}$ is used for the discriminator and $5\times {10}^{5}$ is used for the generator and the style encoder. Unlike the setting of ? (?), we only update the discriminator once per every generator updates. We set ${\lambda}_{P}$ to 10, ${\lambda}_{PF}$ to 0.01, ${\lambda}_{FM}$ to 10, and the number of target images $K$ to 4 during the training.
Appendix B Landmark Transformer Details
Landmark decomposition
Formally, landmark decomposition is calculated as:
${\overline{\mathbf{l}}}_{m}$  $={\displaystyle \frac{1}{CT}}{\displaystyle \sum _{c}}{\displaystyle \sum _{t}}\overline{\mathbf{l}}(c,t),$  (13)  
${\overline{\mathbf{l}}}_{id}(c)$  $={\displaystyle \frac{1}{{T}_{c}}}{\displaystyle \sum _{t}}\overline{\mathbf{l}}(c,t){\overline{\mathbf{l}}}_{m},$  
${\overline{\mathbf{l}}}_{exp}(c,t)$  $=\overline{\mathbf{l}}(c,t){\overline{\mathbf{l}}}_{m}{\overline{\mathbf{l}}}_{id}(c)$  
$=\overline{\mathbf{l}}(c,t){\displaystyle \frac{1}{{T}_{c}}}{\displaystyle \sum _{t}}\overline{\mathbf{l}}(c,t),$ 
where $C$ is the number of videos, ${T}_{c}$ is the number of frames of $c$th video, and $T=\sum {T}_{c}$. We can easily compute the components shown in Equation 13 from the training dataset.
However, when an image of unseen identity ${c}^{\prime}$ is given, the decomposition of the identity and the expression shown in Equation 13 is not possible since ${\overline{\mathbf{l}}}_{exp}({c}^{\prime},t)$ will be zero for a single image. Even when a few frames of an unseen identity ${c}^{\prime}$ is given, ${\overline{\mathbf{l}}}_{exp}({c}^{\prime},t)$ will be zero (or near zero) if the expressions in the given frames are not diverse enough. Thus, to perform the decomposition shown in Equation 13 even under the oneshot or fewshot settings, we introduce landmark disentangler.
Landmark disentanglement
To compute the expression basis ${\mathbf{b}}_{exp}$, using the expression geometry obtained from the VoxCeleb1 training data, we divide a landmark into different groups (e.g., left eye, right eye, eyebrows, mouth, and any other) and perform PCA on each group. We utilize PCA dimensions of 8, 8, 8, 16 and 8, for each group, resulting in a total number of expression bases, ${n}_{exp}$, of 48.
We train landmark disentangler on the VoxCeleb1 training set, separately. Before training landmark disentangler, we normalized each expression parameter ${\alpha}_{i}$ to follow a standard normal distribution $\mathcal{N}(0,{1}^{2})$ for the ease of regression training. We employ ResNet50, which is pretrained on ImageNet (?), and extract features from the first layer to the last layer right before the global average pooling layer. Extracted image features are concatenated with the normalized landmark $\overline{\mathbf{l}}$ subtracted by the mean landmark ${\overline{\mathbf{l}}}_{m}$, and fed into a 2layer MLP followed by a ReLU activation. The whole network is optimized by minimizing the MSE loss between the predicted expression parameters and the target expression parameters, using Adam optimizer with a learning rate of $3\times {10}^{4}$. We use gradient clipping with the maximum gradient norm of 1 during the training. We set the expression intensity parameter ${\lambda}_{exp}$ to 1.5.
Appendix C Additional Ablation Tests
Quantitative results
In Table 1 and Table 2 of the main paper, MarioNETte shows better PRMSE and AUCON under the selfreenactment setting on VoxCeleb1 compared to NeuralHeadFF, which, however, is reversed under the reenactment of a different identity on CelebV. We provide an explanation of this phenomenon through an ablation study.
Table 5 illustrates the evaluation results of ablation models under selfreenactment settings on VoxCeleb1. Unlike the evaluation results of reenacting a different identity on CelebV (Table 4 of the main paper), +Alignment and MarioNETte show better PRMSE and AUCON compared to the AdaIN. The phenomenon may be attributed to the characteristics of the training dataset as well as the different inductive biases of different models. VoxCeleb1 consists of short video clips (usually 510s long), leading to similar poses and expressions between drivers and targets. Unlike the AdaINbased model which is unaware of spatial information, the proposed image attention block and the target feature alignment encode spatial information from the target image. We suspect that this may lead to possible overfitting of the proposed model to the same identity pair with a similar pose and expression setting.
Model (# target)  CSIM$\uparrow $  PRMSE$\downarrow $  AUCON$\uparrow $ 
AdaIN (1)  0.183  3.719  0.781 
+Attention (1)  0.611  3.257  0.825 
+Alignment (1)  0.756  3.069  0.827 
MarioNETte (1)  0.755  3.125  0.825 
AdaIN (8)  0.188  3.649  0.787 
+Attention (8)  0.717  2.909  0.843 
+Alignment (8)  0.826  2.563  0.845 
MarioNETte (8)  0.828  2.571  0.850 
Qualitative results
Figure 9 and Figure 10 illustrate the results of ablation models reenacting a different identity on CelebV under the oneshot and fewshot settings, respectively. While AdaIN fails to generate an image that resembles the target identity, +Attention successfully maintains the key characteristics of the target. The target feature alignment module adds finegrained details to the generated image. However, MarioNETte tends to generate more natural images in a fewshot setting, while +Alignment struggles to deal with multiple target images with diverse poses and expressions.
Appendix D Inference Time
In this section, we report the inference time of our model. We measured the latency of the proposed method while generating $256\times 256$ images with different number of target images, K $\in \{1,8\}$. We ran each setting for 300 times and report the average speed. We utilized Nvidia Titan Xp and Pytorch 1.0.1.post2. As mentioned in the main paper, we used the opensourced implementation of ? (?) to extract 3D facial landmarks.
Description  Symbol  Inference time (ms) 

3D Landmark Detector  ${T}_{P}$  101 
Target Encoder  ${T}_{E,K}$  44 (K=1), 111 (K=8) 
Target Landmark Transformer  ${T}_{TLT,K}$  22 (K=1), 19 (K=8) 
Generator  ${T}_{G,K}$  35 (K=1), 36 (K=8) 
Driver Landmark Transformer  ${T}_{DLT}$  26 
Model  Target encoding  Driver generation 

MarioNETte+LT  $K\cdot {T}_{P}+{T}_{TLT,K}+{T}_{E,K}$  ${T}_{P}+{T}_{DLT}+{T}_{G,K}$ 
MarioNETte  $K\cdot {T}_{P}+{T}_{E,K}$  ${T}_{P}+{T}_{G,K}$ 
Table 6 displays the inference time breakdown of our models. Total inference time of the proposed models, MarioNETte+LT and MarioNETte, can be derived as shown in Table 7. While generating reenactment videos, ${z}_{y}$ and $\widehat{\mathbf{S}}$, utilized to compute the target encoding, is generated only once at the beginning. Thus, we divide our inference pipeline into Target encoding part and the Driver generation part.
Since we perform a batched inference for multiple target images, the inference time of the proposed components (e.g., the target encoder and the target landmark transformer) scale sublinearly to the number of target images $K$. On the other hand, the opensource 3D landmark detector processes images in a sequential manner, and thus, its processing time scales linearly.
Appendix E Additional Examples of Generated Images
We provide additional qualitative results of the baseline methods and the proposed models on VoxCeleb1 and CelebV datasets. We report the qualitative results for both oneshot and fewshot (8 target images) settings, except MonkeyNet which is designed for using only a single image. In the case of the fewshot reenactment, we display only one target image, due to the limited space.
Figure 11 and Figure 12 compare different methods for the selfreenactment on VoxCeleb1 in oneshot and fewshot settings, respectively. Examples of oneshot and fewshot reenactments on VoxCeleb1 where driver’s and target’s identity do not match is shown in Figures 13 and Figure 14, respectively.
Figure 15, Figure 16, and Figure 17 depict the qualitative results on the CelebV dataset. Oneshot and fewshot selfreenactment settings of various methods are compared in Figures 15 and Figure 16, respectively. The results of reenacting a different identity on CelebV under the fewshot setting can be found in Figure 17.
Figure 18 reveals failure cases generated by MarioNETte+LT while performing a oneshot reenactment under different identity setting on VoxCeleb1. Large pose difference between the driver and the target seems to be the main reason for the failures.