使用单词顶点标记的图和整数线性编程的多句子压缩的多语言研究

论文标题

使用单词顶点标记的图和整数线性编程的多句子压缩的多语言研究

A Multilingual Study of Multi-Sentence Compression using Word Vertex-Labeled Graphs and Integer Linear Programming

论文作者

Pontes, Elvys Linhares, Huet, Stéphane, Torres-Moreno, Juan-Manuel, da Silva, Thiago G., Linhares, Andréa Carneiro

论文摘要

多句子压缩（MSC）旨在生成一个简短的句子，其中包含来自类似句子的群集的关键信息。 MSC使摘要和提问系统能够生成结合一个或几个文档中完全形成的句子的输出。本文使用顶点标记的图来介绍用于MSC的整数线性编程方法来选择不同的关键字，目的是在保持其语法性的同时生成更有信息的句子。我们的系统质量良好，胜过以三种语言的新闻数据集的评估状态：法语，葡萄牙语和西班牙语。我们领导了自动评估和手动评估，以确定每个数据集的压缩性的信息性和语法性。在其他测试中，可以调制压缩长度的事实，我们仍然使用较短的输出句子提高了胭脂分数。

Multi-Sentence Compression (MSC) aims to generate a short sentence with the key information from a cluster of similar sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes an Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, with the goal of generating more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state of the art for evaluations led on news datasets in three languages: French, Portuguese and Spanish. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. In additional tests, which take advantage of the fact that the length of compressions can be modulated, we still improve ROUGE scores with shorter output sentences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题