Abstract
The area of portrait image animation, propelled by audio input, has witnessednotable progress in the generation of lifelike and dynamic portraits.Conventional methods are limited to utilizing either audios or facial keypoints to drive images into videos, while they can yield satisfactory results,certain issues exist. For instance, methods driven solely by audios can beunstable at times due to the relatively weaker audio signal, while methodsdriven exclusively by facial key points, although more stable in driving, canresult in unnatural outcomes due to the excessive control of key pointinformation. In addressing the previously mentioned challenges, in this paper,we introduce a novel approach which we named EchoMimic. EchoMimic isconcurrently trained using both audios and facial landmarks. Through theimplementation of a novel training strategy, EchoMimic is capable of generatingportrait videos not only by audios and facial landmarks individually, but alsoby a combination of both audios and selected facial landmarks. EchoMimic hasbeen comprehensively compared with alternative algorithms across various publicdatasets and our collected dataset, showcasing superior performance in bothquantitative and qualitative evaluations. Additional visualization and accessto the source code can be located on the EchoMimic project page.