Deeperdive：弱监督在与UIPATH INC合作了解案例研究的文档中的不合理效力

论文标题

Deeperdive：弱监督在与UIPATH INC合作了解案例研究的文档中的不合理效力

DeeperDive: The Unreasonable Effectiveness of Weak Supervision in Document Understanding A Case Study in Collaboration with UiPath Inc

论文作者

Elwany, Emad, Hegel, Allison, Shah, Marina, Roof, Brendan, Peaslee, Genevieve, Rivet, Quentin

论文摘要

近年来，薄弱的监督已应用于各种自然语言理解任务。由于技术挑战范围扩大了较弱的长期文档的监督，跨越了数百页，因此在文档理解空间中的应用程序受到限制。在Lexion，我们建立了一个针对长格式（长10-200页）PDF文档量身定制的基于监督的弱系统。我们使用此平台来构建数十种语言理解模型，并成功地应用于从商业协议到公司编队文件的各个领域。在本文中，我们在有限的时间，劳动力和培训数据的情况下以较弱的监督进行了监督学习的有效性。我们在一周的时间内建立了8个高质量的机器学习模型，借助一支仅300个以下文档数据集的小组组成的小组。我们分享了一些有关我们的整体体系结构，如何利用弱监督以及能够实现的结果的细节。我们还包括想要尝试替代方法或完善我们的研究人员的数据集。此外，我们阐明了使用PDF格式扫描不佳的长格式文档时出现的其他复杂性，以及一些有助于我们在此类数据上实现最新性能的技术。

Weak supervision has been applied to various Natural Language Understanding tasks in recent years. Due to technical challenges with scaling weak supervision to work on long-form documents, spanning up to hundreds of pages, applications in the document understanding space have been limited. At Lexion, we built a weak supervision-based system tailored for long-form (10-200 pages long) PDF documents. We use this platform for building dozens of language understanding models and have applied it successfully to various domains, from commercial agreements to corporate formation documents. In this paper, we demonstrate the effectiveness of supervised learning with weak supervision in a situation with limited time, workforce, and training data. We built 8 high quality machine learning models in the span of one week, with the help of a small team of just 3 annotators working with a dataset of under 300 documents. We share some details about our overall architecture, how we utilize weak supervision, and what results we are able to achieve. We also include the dataset for researchers who would like to experiment with alternate approaches or refine ours. Furthermore, we shed some light on the additional complexities that arise when working with poorly scanned long-form documents in PDF format, and some of the techniques that help us achieve state-of-the-art performance on such data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题