Indicsuperb：语音处理印度语言的通用性能基准

论文标题

Indicsuperb：语音处理印度语言的通用性能基准

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

论文作者

Javed, Tahir, Bhogale, Kaushal Santosh, Raman, Abhigyan, Kunchukuttan, Anoop, Kumar, Pratyush, Khapra, Mitesh M.

论文摘要

AI研究中的基石是创建和采用标准化培训和测试数据集，以指定最新模型的进度。一个特别成功的例子是用于培训和评估英语自然语言理解（NLU）模型的胶水数据集。围绕自我监督的基于BERT的语言模型的大量研究围绕着胶水中NLU任务的性能改进。为了评估其他语言的语言模型，创建了几种特定语言的胶水数据集。语音语言理解（SLU）的领域也遵循了类似的轨迹。诸如WAV2VEC2之类的大型自我监管模型的成功实现了具有相对易于访问的未标记数据的语音模型。然后可以在SLU任务（例如出色的基准测试）上评估这些模型。在这项工作中，我们将其扩展到通过释放Indicsuperb基准测试来指示语言。具体来说，我们做出以下三项贡献。（i）我们收集了Kathbath，其中包含来自印度203个地区的1,218个贡献者的12个印度语言的1,684小时标记的语音数据。（ii）使用Kathbath，我们在6个语音任务上创建基准：自动语音识别，说话者验证，说话者识别（单声道/多），语言识别，逐个示例查询以及对12种语言的关键字发现。（iii）在发布的基准测试中，我们与常用的基线Fbank一起训练和评估不同的自我监管模型。我们表明，在大多数任务上，特定于语言的微调模型比基线更准确，包括对于语言识别任务的较大差距为76 \％。但是，对于说话者身份证明，在大型数据集上训练的自我监督模型证明了一个优势。我们希望Indicsuperb有助于开发印度语言的语音语言理解模型的进步。

A cornerstone in AI research has been the creation and adoption of standardized training and test datasets to earmark the progress of state-of-the-art models. A particularly successful example is the GLUE dataset for training and evaluating Natural Language Understanding (NLU) models for English. The large body of research around self-supervised BERT-based language models revolved around performance improvements on NLU tasks in GLUE. To evaluate language models in other languages, several language-specific GLUE datasets were created. The area of speech language understanding (SLU) has followed a similar trajectory. The success of large self-supervised models such as wav2vec2 enable creation of speech models with relatively easy to access unlabelled data. These models can then be evaluated on SLU tasks, such as the SUPERB benchmark. In this work, we extend this to Indic languages by releasing the IndicSUPERB benchmark. Specifically, we make the following three contributions. (i) We collect Kathbath containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech tasks: Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train and evaluate different self-supervised models alongside a commonly used baseline FBANK. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks, including a large gap of 76\% for the Language Identification task. However, for speaker identification, self-supervised models trained on large datasets demonstrate an advantage. We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题