HLDC：印地语法律文件语料库

论文标题

HLDC：印地语法律文件语料库

HLDC: Hindi Legal Documents Corpus

论文作者

Kapoor, Arnav, Dhawan, Mudit, Goel, Anmol, Arjun, T. H., Bhatnagar, Akshala, Agrawal, Vibhu, Agrawal, Amul, Bhattacharya, Arnab, Kumaraguru, Ponnurangam, Modi, Ashutosh

论文摘要

包括印度在内的许多人口众多的国家都充满了大量的法律案件。可以开发可以处理法律文件并扩大法律从业者的自动化系统可以减轻这种情况。但是，需要缺乏高质量的语料库来开发此类数据驱动系统。在低资源语言（例如印地语）的情况下，这个问题变得更加明显。在此资源文件中，我们介绍了印地语法律文档（HLDC）的印地语法律文件，这是印地语超过90万个法律文件的语料库。清理文档并结构结构，以实现下游应用程序的开发。此外，作为语料库的用例，我们介绍了保释预测任务。我们尝试了一系列模型，并提出了基于多任务的学习（MTL）模型。 MTL模型将摘要作为辅助任务以及保释预测作为主要任务。具有不同模型的实验表明需要在该领域进行进一步研究。我们使用本文发布语料库和模型实施代码：https：//github.com/exploration-lab/hldc

Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper: https://github.com/Exploration-Lab/HLDC

下载PDF全文

下载文献需遵守相关版权规定

论文标题