All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks

Guardado en:

Detalles Bibliográficos
Publicado en:	Electronics vol. 14, no. 17 (2025), p. 3487-3506
Autor principal:	Geldhauser Carina
Otros Autores:	Liljegren Johan, Nordqvist Pontus
Publicado:	MDPI AG
Materias:	Speech Machine learning Computer-generated imagery Performance measurement Deep learning Computer vision Video recordings Neural networks Generative adversarial networks Synchronism Audio data Animation Realism
Acceso en línea:	Citation/Abstract Full Text + Graphics Full Text - PDF
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Descripción
Resumen:	This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale “everyday usage” setting. Two models, LipGAN and a custom implementation of a Wasserstein GAN with gradient penalty (L1WGAN-GP), are examined for their visual performance and scoring according to commonly used metrics: Quantitative comparisons using FID, SSIM, and PSNR metrics on the GRIDTest dataset show mixed results, and metrics fail to capture local artifacts crucial for lip synchronization, pointing to limitations in their applicability for video animation tasks. The study points towards the inadequacy of current quantitative measures and emphasizes the continued necessity of human qualitative assessment for evaluating talking-head video quality.
ISSN:	2079-9292
DOI:	10.3390/electronics14173487
Fuente:	Advanced Technologies & Aerospace Database