法律堆：从法律学习负责任的数据过滤和256GB的开源法律数据集

论文标题

法律堆：从法律学习负责任的数据过滤和256GB的开源法律数据集

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

论文作者

Henderson, Peter, Krass, Mark S., Zheng, Lucia, Guha, Neel, Manning, Christopher D., Jurafsky, Dan, Ho, Daniel E.

论文摘要

大语言模型的兴起的一个问题在于它们可能造成重大伤害的潜力，尤其是在有偏见，淫秽，版权和私人信息的情况下进行预处理。新兴的道德方法试图过滤预处理的材料，但是这种方法是临时的，没有考虑到上下文。我们提供了一种基于法律的过滤方法，该方法直接解决了过滤材料的权衡。首先，我们收集并提供了一堆法律，这是一个256GB（以及增长的）数据集的开源英语法律和行政数据，涵盖法院意见，合同，行政规则和立法记录。对一堆法律进行预处理可能有助于承诺提高司法通道的承诺。其次，我们提炼了政府已开发出来的法律规范，以将有毒或私人内容限制为研究人员的可行课程，并讨论我们的数据集如何反映这些规范。第三，我们展示了一堆法律如何为研究人员提供直接从数据中学习此类过滤规则的机会，从而为基于模型的处理提供了令人兴奋的新研究方向。

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take context into account. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题