论文标题
投票数对众包方法中主观语音质量评估的可靠性和有效性的影响
Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach
论文作者
论文摘要
传统上,根据ITU-T REC在受控的实验室环境中评估传播语音的主观质量。第800页。反过来,随着众包,拥克工人将使用自己的听力设备和自己的工作环境进行主观的在线实验。尽管有较低的控制条件,但众包微任务平台用于质量评估任务的使用增加了对标准化方法的高需求,从而导致了ITU-T REC。第808页。这项工作调查了判断数量对通过基于众包的语音质量评估收集的质量评级的可靠性和有效性的影响,这是对ITU-T REC的投入。第808页。使用绝对类别评级程序,在不同平台上进行了三个众包实验,以评估三个不同语音数据集的总体质量。对于每个数据集,平均意见分数(MOS)是使用不同数量的众包判断来计算的。然后将结果与标准实验室实验中收集的MOS值进行比较,以评估众包方法的有效性作为投票数的函数。此外,通过检查评分者间的可靠性,确定性和MOS的信心来分析平均得分的可靠性。结果提供了有关每个条件所需的投票数的建议,并允许对其对有效性和可靠性的影响进行建模。
The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for quality assessment tasks has pushed a high demand for standardized methods, resulting in ITU-T Rec. P.808. This work investigates the impact of the number of judgments on the reliability and the validity of quality ratings collected through crowdsourcing-based speech quality assessments, as an input to ITU-T Rec. P.808 . Three crowdsourcing experiments on different platforms were conducted to evaluate the overall quality of three different speech datasets, using the Absolute Category Rating procedure. For each dataset, the Mean Opinion Scores (MOS) are calculated using differing numbers of crowdsourcing judgements. Then the results are compared to MOS values collected in a standard laboratory experiment, to assess the validity of crowdsourcing approach as a function of number of votes. In addition, the reliability of the average scores is analyzed by checking inter-rater reliability, gain in certainty, and the confidence of the MOS. The results provide a suggestion on the required number of votes per condition, and allow to model its impact on validity and reliability.