Abstract
The Fr\'echet Video Distance (FVD) is a widely adopted metric for evaluatingvideo generation distribution quality. However, its effectiveness relies oncritical assumptions. Our analysis reveals three significant limitations: (1)the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) theinsensitivity of I3D features to temporal distortions; (3) the impracticalsample sizes required for reliable estimation. These findings undermine FVD'sreliability and show that FVD falls short as a standalone metric for videogeneration evaluation. After extensive analysis of a wide range of metrics andbackbone architectures, we propose JEDi, the JEPA Embedding Distance, based onfeatures derived from a Joint Embedding Predictive Architecture, measured usingMaximum Mean Discrepancy with polynomial kernel. Our experiments on multipleopen-source datasets show clear evidence that it is a superior alternative tothe widely used FVD metric, requiring only 16% of the samples to reach itssteady value, while increasing alignment with human evaluation by 34%, onaverage.