Teacher-Student Training and Triplet Loss for Facial Expression Recognition under Occlusion

Abstract

In this paper, we study the task of facial expression recognition understrong occlusion. We are particularly interested in cases where 50% of the faceis occluded, e.g. when the subject wears a Virtual Reality (VR) headset. Whileprevious studies show that pre-training convolutional neural networks (CNNs) onfully-visible (non-occluded) faces improves the accuracy, we propose to employknowledge distillation to achieve further improvements. First of all, we employthe classic teacher-student training strategy, in which the teacher is a CNNtrained on fully-visible faces and the student is a CNN trained on occludedfaces. Second of all, we propose a new approach for knowledge distillationbased on triplet loss. During training, the goal is to reduce the distancebetween an anchor embedding, produced by a student CNN that takes occludedfaces as input, and a positive embedding (from the same class as the anchor),produced by a teacher CNN trained on fully-visible faces, so that it becomessmaller than the distance between the anchor and a negative embedding (from adifferent class than the anchor), produced by the student CNN. Third of all, wepropose to combine the distilled embeddings obtained through the classicteacher-student strategy and our novel teacher-student strategy based ontriplet loss into a single embedding vector. We conduct experiments on twobenchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face,showing that knowledge distillation can bring significant improvements over thestate-of-the-art methods designed for occluded faces in the VR setting.

Quick Read (beta)

loading the full paper ...