责任机器学习数据集：软件工程和基础架构的实践

论文标题

责任机器学习数据集：软件工程和基础架构的实践

Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

论文作者

Hutchinson, Ben, Smart, Andrew, Hanna, Alex, Denton, Emily, Greer, Christina, Kjartansson, Oddur, Barnes, Parker, Mitchell, Margaret

论文摘要

人们对人工智能系统的社会影响的关注激发了对更大透明度和问责制的需求。但是，赋予机器学习能力的数据集经常被使用，共享和重复使用，而对审议过程的可见性很少，从而导致了他们的创建。当数据集被构思时，哪些利益相关者群体具有他们的观点？就如何建模亚组和其他现象咨询了哪些领域专家？如何衡量和解决代表性偏见问题？谁标记了数据？在本文中，我们引入了一个严格的数据集开发透明度框架，该透明度支持决策和问责制。该框架使用数据集开发的周期性，基础设施和工程性质来利用软件开发生命周期的最佳实践。数据开发生命周期的每个阶段都会产生一组文档，以促进改进的沟通和决策，并吸引注意仔细数据工作的价值和必要性。所提出的框架旨在通过使可见的经常被忽视的工作中的工作来缩小人工智能系统中的问责差距。

Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题