论文标题

使用变压器模型的大规模法律文本分类

Large Scale Legal Text Classification Using Transformer Models

论文作者

Shaheen, Zein, Wohlgenannt, Gerhard, Filtz, Erwin

论文摘要

大型多标签文本分类是一个具有挑战性的自然语言处理(NLP)问题,与具有数千个标签的数据集有关。我们在法律领域中解决了这个问题,在该数据集中,在欧盟的法律信息系统中创建了标记为Eurovoc词汇的JRC-AC-AC-AC-AC-AC-AC-AC-AC-ACLIC和EURLEX57K。 Eurovoc分类法包括大约7000个概念。在这项工作中,我们研究了各种基于变压器的模型的性能,以及诸如生成预处理,逐渐脱离和歧视性学习率之类的策略,以达到竞争性的分类性能,并为JRC-ACQUISIE和0.661(F1)提供新的最先进结果,EURLEX57K的jrc-acciquic和0.754。此外,我们量化了单个步骤的影响,例如语言模型微调或在消融研究中逐渐解冻,并提供使用迭代分层算法创建的参考数据集拆分。

Large multi-label text classification is a challenging Natural Language Processing (NLP) problem that is concerned with text classification for datasets with thousands of labels. We tackle this problem in the legal domain, where datasets, such as JRC-Acquis and EURLEX57K labeled with the EuroVoc vocabulary were created within the legal information systems of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various recent transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0.661 (F1) for JRC-Acquis and 0.754 for EURLEX57K. Furthermore, we quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study, and provide reference dataset splits created with an iterative stratification algorithm.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源