论文标题

基于深度学习的多标签文本分类UNGA决议

Deep Learning Based Multi-Label Text Classification of UNGA Resolutions

论文作者

Sovrano, Francesco, Palmirani, Monica, Vitali, Fabio

论文摘要

这项研究的主要目的是为联合国(UN)生产有用的软件,该软件可以帮助加快在可持续发展目标(SDG)之后加快联合国文件资格的过程,以监视世界一级的进步,以抵抗贫困,歧视,气候变化。实际上,鉴于受影响的语料库的规模,联合国文件的人类标签将是一项艰巨的任务。因此,至少必须将自动标记作为多相过程的第一步,以减少分类和分类的整体努力。如今,深度学习(DL)是用于此任务的最新工具(SOTA)AI的最有力的工具之一,但通常情况下,它带来了训练集的昂贵且容易出错的准备。在特定领域文本的多标签文本分类的情况下,如果没有大型域特异性训练集,我们似乎无法有效地采用DL。在本文中,我们表明这并不总是正确的。实际上,我们提出了一种新颖的方法,可以通过TF-IDF等统计数据来利用预先训练的SOTA DL模型(例如通用句子编码器),而无需传统的转移学习或任何其他昂贵的培训程序。我们通过根据其最相关的可持续发展目标对联合国的决议进行分类,从而在法律背景下显示了我们方法的有效性。

The main goal of this research is to produce a useful software for United Nations (UN), that could help to speed up the process of qualifying the UN documents following the Sustainable Development Goals (SDGs) in order to monitor the progresses at the world level to fight poverty, discrimination, climate changes. In fact human labeling of UN documents would be a daunting task given the size of the impacted corpus. Thus, automatic labeling must be adopted at least as a first step of a multi-phase process to reduce the overall effort of cataloguing and classifying. Deep Learning (DL) is nowadays one of the most powerful tools for state-of-the-art (SOTA) AI for this task, but very often it comes with the cost of an expensive and error-prone preparation of a training-set. In the case of multi-label text classification of domain-specific text it seems that we cannot effectively adopt DL without a big-enough domain-specific training-set. In this paper, we show that this is not always true. In fact we propose a novel method that is able, through statistics like TF-IDF, to exploit pre-trained SOTA DL models (such as the Universal Sentence Encoder) without any need for traditional transfer learning or any other expensive training procedure. We show the effectiveness of our method in a legal context, by classifying UN Resolutions according to their most related SDGs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源