Abstract
Humans modify their facial expressions in order to communicate their internalstates and sometimes to mislead observers regarding their true emotionalstates. Evidence in experimental psychology shows that discriminative facialresponses are short and subtle. This suggests that such behavior would beeasier to distinguish when captured in high resolution at an increased framerate. We are proposing SASE-FE, the first dataset of facial expressions thatare either congruent or incongruent with underlying emotion states. We showthat overall the problem of recognizing whether facial movements areexpressions of authentic emotions or not can be successfully addressed bylearning spatio-temporal representations of the data. For this purpose, wepropose a method that aggregates features along fiducial trajectories in adeeply learnt space. Performance of the proposed model shows that on average itis easier to distinguish among genuine facial expressions of emotion than amongunfelt facial expressions of emotion and that certain emotion pairs such ascontempt and disgust are more difficult to distinguish than the rest.Furthermore, the proposed methodology improves state of the art results on CK+and OULU-CASIA datasets for video emotion recognition, and achieves competitiveresults when classifying facial action units on BP4D datase.