GAE-ISUMM：印度语言的无监督的基于图的摘要

论文标题

GAE-ISUMM：印度语言的无监督的基于图的摘要

GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages

论文作者

Vakada, Lakshmi Sireesha, Ch, Anudeep, Marreddy, Mounika, Oota, Subba Reddy, Mamidi, Radhika

论文摘要

文档摘要旨在创建文本文档的精确且连贯的摘要。许多深度学习摘要模型主要是为英语开发的，通常需要大型培训语料库和有效的预训练的语言模型和工具。但是，低资源印度语言的英语摘要模型通常受到丰富的形态变化，语法和语义差异的限制。在本文中，我们提出了Gae-Isumm，这是一种无监督的指示摘要模型，从文本文档中提取摘要。特别是，我们提出的模型GAE-ISUMM使用Graph AutoCododer（GAE）共同学习文本表示形式和文档摘要。我们还提供了手动注销的泰卢固语摘要数据集Telsum，以实验我们的模型GAE-ISUMM。此外，我们尝试了最公开的印度语言摘要数据集，以研究GAE-ISUMM对其他印度语言的有效性。我们在七种语言中对GAE-ISUMM进行的实验进行了以下观察：（i）它比所有数据集中的最先进结果具有竞争力或更好，（ii）它报告了基准的基准结果，以及（iii）在建议的模型中包含位置和群集信息的模型提高了总结的绩效。

Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题