G2P-DDM：通过离散扩散模型从光泽序列中生成符号姿势序列

论文标题

G2P-DDM：通过离散扩散模型从光泽序列中生成符号姿势序列

G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

论文作者

Xie, Pan, Zhang, Qipeng, Peng, Taiyi, Tang, Hao, Du, Yao, Li, Zexian

论文摘要

手语制作（SLP）项目的目的是将口语语言自动转化为符号序列。我们的方法着重于将符号光泽序列转换为其相应的标志姿势序列（G2P）。在本文中，我们通过将连续的姿势空间生成问题转换为离散的序列生成问题，为这项任务提供了一种新的解决方案。我们介绍了姿势VQVAE框架，该框架将变异自动编码器（VAE）与矢量量化结合在一起，以产生连续姿势序列的离散潜在表示。此外，我们提出了G2P-DDM模型，这是一种用于长度变化离散序列数据的离散deno型扩散体系结构，以模拟潜在的先验。为了进一步提高离散空间中姿势序列产生的质量，我们介绍了代码单元模型以利用时空信息。最后，我们开发了一种启发式顺序聚类方法，以预测相应的光泽序列的姿势序列的可变长度。我们的结果表明，我们的模型在公共SLP评估基准上优于最先进的G2P模型。有关更多生成的结果，请访问我们的项目页面：\ TextColor {blue} {\ url {https://slpdiffusier.github.io/g2p-ddm}}

The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the Pose-VQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: \textcolor{blue}{\url{https://slpdiffusier.github.io/g2p-ddm}}

下载PDF全文

下载文献需遵守相关版权规定

论文标题