论文标题
绘制有关NLP性能影响的因果推断
Drawing Causal Inferences About Performance Effects in NLP
论文作者
论文摘要
本文强调,NLP作为一门科学,试图推断出在自然语言处理中应用一种方法(与另一种方法相比)所产生的性能效应。然而,实践中的NLP研究通常无法实现这一目标:在NLP研究文章中,通常只比较一些模型。每个模型都来自特定的程序管道(此处命名处理系统),该管道由特定的方法集合组成,这些方法用于预处理,预读,预读,超参数调整以及针对目标任务的培训。为了概括有关应用某种方法A与另一种方法B引起的性能效应的推断,不足以比较一些由一些特定(可能无与伦比的)处理系统产生的一些特定模型。相反,以下过程将允许对方法的性能效应进行推断:(1)必须定义研究人员试图推断的处理系统的群体。 (2)从该人群中绘制了随机的处理系统样本。 (The drawn processing systems in the sample will vary with regard to the methods they apply along their procedural pipelines and also will vary regarding the compositions of their training and test data sets used for training and evaluation.) (3) Each processing system is applied once with method A and once with method B. (4) Based on the sample of applied processing systems, the expected generalization errors of method A and method B are approximated. (5)方法A和方法B的预期概括误差之间的差异是在处理系统群体中与方法B相比,由于应用方法A的估计平均治疗效果。
This article emphasizes that NLP as a science seeks to make inferences about the performance effects that result from applying one method (compared to another method) in the processing of natural language. Yet NLP research in practice usually does not achieve this goal: In NLP research articles, typically only a few models are compared. Each model results from a specific procedural pipeline (here named processing system) that is composed of a specific collection of methods that are used in preprocessing, pretraining, hyperparameter tuning, and training on the target task. To make generalizing inferences about the performance effect that is caused by applying some method A vs. another method B, it is not sufficient to compare a few specific models that are produced by a few specific (probably incomparable) processing systems. Rather, the following procedure would allow drawing inferences about methods' performance effects: (1) A population of processing systems that researchers seek to infer to has to be defined. (2) A random sample of processing systems from this population is drawn. (The drawn processing systems in the sample will vary with regard to the methods they apply along their procedural pipelines and also will vary regarding the compositions of their training and test data sets used for training and evaluation.) (3) Each processing system is applied once with method A and once with method B. (4) Based on the sample of applied processing systems, the expected generalization errors of method A and method B are approximated. (5) The difference between the expected generalization errors of method A and method B is the estimated average treatment effect due to applying method A compared to method B in the population of processing systems.