论文标题
使用指针生成器网络在法律域中命名实体识别
Named Entity Recognition in the Legal Domain using a Pointer Generator Network
论文作者
论文摘要
命名实体识别(NER)是在非结构化文本中识别和分类命名实体的任务。在法律领域中,指定的感兴趣实体可能包括案件各方,法官,法院的姓名,案件号,对法律的参考等。我们研究了法律文本的法律问题,并从美国法院提起的法院案件的PDF文件中提取了嘈杂的文本。 NER系统的“黄金标准”培训数据为文本的每个令牌提供了注释,并具有相应的实体或非实体标签。我们仅与部分完整的培训数据一起工作,这些数据与黄金标准NER数据不同,因为文本中实体的确切位置未知,并且实体可能包含错别字和/或OCR错误。为了克服我们嘈杂的培训数据的挑战,例如文本提取错误和/或错别字以及未知标签索引,我们将NER任务作为文本到文本序列生成任务,并训练指针生成器网络以生成文档中的实体,而不是标记它们。我们表明,在没有黄金标准数据的情况下,指针生成器对NER有效,并且在长期法律文档中胜过常见的NER神经网络体系结构。
Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws etc. We study the problem of legal NER with noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g. text extraction errors and/or typos and unknown label indices, we formulate the NER task as a text-to-text sequence generation task and train a pointer generator network to generate the entities in the document rather than label them. We show that the pointer generator can be effective for NER in the absence of gold standard data and outperforms the common NER neural network architectures in long legal documents.