重新思考连贯建模：合成与下游任务

论文标题

重新思考连贯建模：合成与下游任务

Rethinking Coherence Modeling: Synthetic vs. Downstream Tasks

论文作者

Mohiuddin, Tasnim, Jwalapuram, Prathyusha, Lin, Xiang, Joty, Shafiq

论文摘要

尽管相干建模在开发新型模型方面已经走了很长一段路，但它们对据称开发的下游应用的评估很大程度上被忽略了。随着神经方法在应用程序（MT），摘要和对话系统等应用中的进步，对这些任务的连贯评估的需求比以往任何时候都更为重要。但是，相干模型通常仅在合成任务上评估，这可能不能代表其在下游应用中的性能。为了研究合成任务的代表性下游用例，我们就合成句子排序任务进行基准测试了众所周知的传统和神经相干模型，并将其与他们在下游应用程序上的性能进行对比：MT和Summarization和Summarization的连贯评估以及基于基于基于基于基于的Dialieval的Dialog dialog dialog dialog dialog coolage评估。我们的结果表明，合成任务中的模型性能与下游应用程序之间的相关性较弱，{激励相干模型的替代训练和评估方法。

Although coherence modeling has come a long way in developing novel models, their evaluation on downstream applications for which they are purportedly developed has largely been neglected. With the advancements made by neural approaches in applications such as machine translation (MT), summarization and dialog systems, the need for coherence evaluation of these tasks is now more crucial than ever. However, coherence models are typically evaluated only on synthetic tasks, which may not be representative of their performance in downstream applications. To investigate how representative the synthetic tasks are of downstream use cases, we conduct experiments on benchmarking well-known traditional and neural coherence models on synthetic sentence ordering tasks, and contrast this with their performance on three downstream applications: coherence evaluation for MT and summarization, and next utterance prediction in retrieval-based dialog. Our results demonstrate a weak correlation between the model performances in the synthetic tasks and the downstream applications, {motivating alternate training and evaluation methods for coherence models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题