论文标题
雪山:低资源语言的圣经录音数据集
Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages
论文作者
论文摘要
自动语音识别(ASR)在现代世界中的实用性越来越大。有许多ASR模型可用于具有大量培训数据(例如英语)的语言。但是,低资源语言的代表性很差。作为响应,我们以低资源北部印度语言创建并发布了圣经的录音的开放许可和格式化数据集。我们设置了多个实验分割,并训练并分析了两个竞争性的ASR模型,以作为未来研究的基准使用此数据。
Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.