论文标题
GAE-ISUMM:印度语言的无监督的基于图的摘要
GAE-ISumm: Unsupervised Graph-Based Summarization of Indian Languages
论文作者
论文摘要
文档摘要旨在创建文本文档的精确且连贯的摘要。许多深度学习摘要模型主要是为英语开发的,通常需要大型培训语料库和有效的预训练的语言模型和工具。但是,低资源印度语言的英语摘要模型通常受到丰富的形态变化,语法和语义差异的限制。在本文中,我们提出了Gae-Isumm,这是一种无监督的指示摘要模型,从文本文档中提取摘要。特别是,我们提出的模型GAE-ISUMM使用Graph AutoCododer(GAE)共同学习文本表示形式和文档摘要。我们还提供了手动注销的泰卢固语摘要数据集Telsum,以实验我们的模型GAE-ISUMM。此外,我们尝试了最公开的印度语言摘要数据集,以研究GAE-ISUMM对其他印度语言的有效性。我们在七种语言中对GAE-ISUMM进行的实验进行了以下观察:(i)它比所有数据集中的最先进结果具有竞争力或更好,(ii)它报告了基准的基准结果,以及(iii)在建议的模型中包含位置和群集信息的模型提高了总结的绩效。
Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.