论文标题
关键字从带有文本到文本传输变压器的短文本提取
Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer
论文作者
论文摘要
本文探讨了波兰语(PLT5)的文本到文本传输变压器语言模型(T5)与从短文本段落中固有和外在关键字提取的任务。该评估是在新的波兰开放科学元数据语料库(POSMAC)上进行的,该科学与本文一起发布:库里卡特项目中编写的216,214篇科学出版物摘要的集合。我们比较了通过四种不同方法获得的结果,即PLT5KW,Extremetext,temopl,Keybert,并得出结论,PLT5KW模型可为频繁代表的关键字带来特别有希望的结果。此外,在POSMAC上训练的PLT5KW关键字生成模型似乎还可以在跨域文本标签方案中产生非常有用的结果。我们讨论了该模型在新闻故事和基于电话的对话框成绩单上的性能,这些成绩单代表文本流派和科学摘要数据集外部域。最后,我们还试图表征在固有和外部关键字提取上评估文本对文本模型的挑战。
The paper explores the relevance of the Text-To-Text Transfer Transformer language model (T5) for Polish (plT5) to the task of intrinsic and extrinsic keyword extraction from short text passages. The evaluation is carried out on the new Polish Open Science Metadata Corpus (POSMAC), which is released with this paper: a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project. We compare the results obtained by four different methods, i.e. plT5kw, extremeText, TermoPL, KeyBERT and conclude that the plT5kw model yields particularly promising results for both frequent and sparsely represented keywords. Furthermore, a plT5kw keyword generation model trained on the POSMAC also seems to produce highly useful results in cross-domain text labelling scenarios. We discuss the performance of the model on news stories and phone-based dialog transcripts which represent text genres and domains extrinsic to the dataset of scientific abstracts. Finally, we also attempt to characterize the challenges of evaluating a text-to-text model on both intrinsic and extrinsic keyword extraction.