论文标题
大规模语音类型学的语料库
A Corpus for Large-Scale Phonetic Typology
论文作者
论文摘要
数据驱动的类型学研究的一个主要障碍是具有多种语言的足够数据来得出有意义的结论。我们提出了Voxclamantis v1.0,这是第一个用于语音类型学的大型语料库,其对齐段和估计的音素级标签在690个读取635种语言的读取中,以及元音的声学元音测量。访问此类数据可以极大地促进大规模和许多语言的语音类型学研究。但是,获得数百种语言的这样的对齐方式是非平地和计算密集型的,其中许多语言目前几乎没有资源。我们描述了创建我们的语料库的方法,讨论了当前方法及其对这些数据实用性的影响的警告,并通过一系列有关48种最高质量读数的案例研究来说明可能的研究方向。我们的语料库和脚本可在https://voxclamantisproject.github.io上公开使用。
A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.