论文标题
廉价红外评估:较少的主题,没有相关性判断和人群评估
Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments
论文作者
论文摘要
为了评估信息检索(IR)有效性,一种可能的方法是使用测试收集,这些方法由文档集合,信息需求的一组描述(称为主题)以及每个主题的一组相关文档组成。测试收集是在竞争场景中建模的:例如,在众所周知的TREC计划中,参与者在一组主题上运行自己的检索系统,并提供了排名的已检索文档列表;一些检索的文件(通常是第一个排名)构成所谓的池,其相关性由人类评估者评估;然后,文档列表用于计算有效性指标并对参与者系统进行排名。私人网络搜索公司还进行内部评估练习;尽管细节大多是未知的,并且目标有所不同,但总体方法与测试收集方法共享了几个问题。 这项工作的目的是:(i)在节省资源的同时评估IR有效性方面的一些最先进的工作,以及(ii)提出一种新颖,更有原则和设计的,整体的,整体的基于测试收集的方法,以基于测试的有效性评估。 [...]
To evaluate Information Retrieval (IR) effectiveness, a possible approach is to use test collections, which are composed of a collection of documents, a set of description of information needs (called topics), and a set of relevant documents to each topic. Test collections are modelled in a competition scenario: for example, in the well known TREC initiative, participants run their own retrieval systems over a set of topics and they provide a ranked list of retrieved documents; some of the retrieved documents (usually the first ranked) constitute the so called pool, and their relevance is evaluated by human assessors; the document list is then used to compute effectiveness metrics and rank the participant systems. Private Web Search companies also run their in-house evaluation exercises; although the details are mostly unknown, and the aims are somehow different, the overall approach shares several issues with the test collection approach. The aim of this work is to: (i) develop and improve some state-of-the-art work on the evaluation of IR effectiveness while saving resources, and (ii) propose a novel, more principled and engineered, overall approach to test collection based effectiveness evaluation. [...]