对Moffat对“ IR评估中有意义的陈述：将评估措施映射到间隔量表”的评论的回应”

论文标题

对Moffat对“ IR评估中有意义的陈述：将评估措施映射到间隔量表”的评论的回应”

Response to Moffat's Comment on "Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales"

论文作者

Ferrante, Marco, Ferro, Nicola, Fuhr, Norbert

论文摘要

莫法特最近评论了我们以前的工作。我们的工作集中于将我们的评估方法的基础置于测量理论中，可以提高我们对IR中使用的评估度量的了解和理解，以及它如何阐明我们的评估措施所采用的不同类型的量表；我们还通过广泛的实验提供了有关不同类型量表对统计分析的影响以及从其假设偏离其假设的影响的证据。此外，我们首次在IR中研究了有意义的概念，即您绘制的实验陈述和推论的不变性，并提出了它作为确保更有效和义务结果的一种方式。莫法特（Moffat）的评论基于：（i）对测量的表示理论的误解，例如，间隔量表实际是什么以及它必须遵守的公理；（ii）他们完全错过了有意义的核心概念。因此，我们通过在衡量理论和有意义的概念中正确地构建莫法特的评论来回答莫法特的评论。总而言之，我们只能重申我们几次所说的话：该研究行的目的是从理论上讲我们的评估方法 - 而IR是一个实现任何理论进步极具挑战性的领域 - 为了实现更强大且可推广的推论 - 我们目前缺乏该领域。可能还有其他更好的方法可以实现这一目标，而这些建议可能会从该领域和他人的工作中进行公开讨论中出现。另一方面，将所有内容简化为（或假装是）间隔量表的对比，或者全部或没有评估措施是间隔量表可能比在实现这一目标方面的帮助中更障碍。

Moffat recently commented on our previous work. Our work focused on how laying the foundations of our evaluation methodology into the theory of measurement can improve our knowledge and understanding of the evaluation measures we use in IR and how it can shed light on the different types of scales adopted by our evaluation measures; we also provided evidence, through extensive experimentation, on the impact of the different types of scales on the statistical analyses, as well as on the impact of departing from their assumptions. Moreover, we investigated, for the first time in IR, the concept of meaningfulness, i.e. the invariance of the experimental statements and inferences you draw, and proposed it as a way to ensure more valid and generalizabile results. Moffat's comments build on: (i) misconceptions about the representational theory of measurement, such as what an interval scale actually is and what axioms it has to comply with; (ii) they totally miss the central concept of meaningfulness. Therefore, we reply to Moffat's comments by properly framing them in the representational theory of measurement and in the concept of meaningfulness. All in all, we can only reiterate what we said several times: the goal of this research line is to theoretically ground our evaluation methodology - and IR is a field where it is extremely challenging to perform any theoretical advances - in order to aim for more robust and generalizable inferences - something we currently lack in the field. Possibly there are other and better ways to achieve this objective and these proposals could emerge from an open discussion in the field and from the work of others. On the other hand, reducing everything to a contrast on what is (or pretend to be) an interval scale or whether all or none evaluation measures are interval scales may be more a barrier from than a help in progressing towards this goal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题