论文标题
法律文件分类:向公共检方请愿书的申请预测
Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service
论文作者
论文摘要
近年来,对自然语言处理(NLP)在法律文件中的应用(NLP)引起了人们的兴趣。当应用于文本分类问题时,卷积和经常性神经网络以及单词嵌入技术(例如情感分析和文档的主题细分)时,提出了有希望的结果。本文建议将NLP技术用于文本分类,目的是将检察官帕拉纳州公共检察官办公室提供的服务的描述与该机构所涵盖的法律领域之一中的人口。我们的主要目标是使将请愿书分配给各自法律领域的过程,从而减少与此类过程相关的成本和时间,同时允许人力资源分配到更复杂的任务。在本文中,我们将不同的方法与上述任务中的单词表示形式进行了比较:包括文档期矩阵和一些不同的单词嵌入。关于分类模型,我们评估了三个不同的家庭:线性模型,增强的树木和神经网络。最好的结果是通过在域特异性语料库和经常性神经网络(RNN)架构(更具体地说是LSTM)的Word2Vec组合获得的,从而使90 \%的精度为90 \%,而F1得分为85 \%,在18个类别的分类中(法律领域)。
In recent years, there has been an increased interest in the application of Natural Language Processing (NLP) to legal documents. The use of convolutional and recurrent neural networks along with word embedding techniques have presented promising results when applied to textual classification problems, such as sentiment analysis and topic segmentation of documents. This paper proposes the use of NLP techniques for textual classification, with the purpose of categorizing the descriptions of the services provided by the Public Prosecutor's Office of the State of Paraná to the population in one of the areas of law covered by the institution. Our main goal is to automate the process of assigning petitions to their respective areas of law, with a consequent reduction in costs and time associated with such process while allowing the allocation of human resources to more complex tasks. In this paper, we compare different approaches to word representations in the aforementioned task: including document-term matrices and a few different word embeddings. With regards to the classification models, we evaluated three different families: linear models, boosted trees and neural networks. The best results were obtained with a combination of Word2Vec trained on a domain-specific corpus and a Recurrent Neural Network (RNN) architecture (more specifically, LSTM), leading to an accuracy of 90\% and F1-Score of 85\% in the classification of eighteen categories (law areas).