廉价红外评估：较少的主题，没有相关性判断和人群评估

论文标题

廉价红外评估：较少的主题，没有相关性判断和人群评估

Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments

论文作者

Roitero, Kevin

论文摘要

为了评估信息检索（IR）有效性，一种可能的方法是使用测试收集，这些方法由文档集合，信息需求的一组描述（称为主题）以及每个主题的一组相关文档组成。测试收集是在竞争场景中建模的：例如，在众所周知的TREC计划中，参与者在一组主题上运行自己的检索系统，并提供了排名的已检索文档列表；一些检索的文件（通常是第一个排名）构成所谓的池，其相关性由人类评估者评估；然后，文档列表用于计算有效性指标并对参与者系统进行排名。私人网络搜索公司还进行内部评估练习；尽管细节大多是未知的，并且目标有所不同，但总体方法与测试收集方法共享了几个问题。这项工作的目的是：（i）在节省资源的同时评估IR有效性方面的一些最先进的工作，以及（ii）提出一种新颖，更有原则和设计的，整体的，整体的基于测试收集的方法，以基于测试的有效性评估。 [...]

To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. [...]

下载PDF全文

下载文献需遵守相关版权规定

论文标题