论文标题
Bigscience:多语言大语模型的社会建构中的案例研究
BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model
论文作者
论文摘要
Bigscience研讨会是一项价值驱动的计划,跨越了一年半的跨学科研究,并最终达到了Roots的创建,这是一个1.6TB的多语言数据集,用于培训Bloom,这是迄今为止最大的多语言模型之一。除了技术成果和工件外,该研讨会还围绕大型模型,数据集及其分析培养了多学科的合作。反过来,这导致了广泛的研究出版物,涵盖了从道德到法律,数据治理,建模选择和分布式培训的主题。本文重点介绍了Bigscience的协作研究方面,并退后一步,研究了参与者多样性以及成功执行此类项目所需的任务。我们的主要目标是分享我们从这段经验中学到的课程,我们可以做得更好以及我们做得很好的事情。我们展示了这种社会方法对科学研究的影响如何远远超出其诞生的基础的技术文物。
The BigScience Workshop was a value-driven initiative that spanned one and half years of interdisciplinary research and culminated in the creation of ROOTS, a 1.6TB multilingual dataset that was used to train BLOOM, one of the largest multilingual language models to date. In addition to the technical outcomes and artifacts, the workshop fostered multidisciplinary collaborations around large models, datasets, and their analysis. This in turn led to a wide range of research publications spanning topics from ethics to law, data governance, modeling choices and distributed training. This paper focuses on the collaborative research aspects of BigScience and takes a step back to look at the challenges of large-scale participatory research, with respect to participant diversity and the tasks required to successfully carry out such a project. Our main goal is to share the lessons we learned from this experience, what we could have done better and what we did well. We show how the impact of such a social approach to scientific research goes well beyond the technical artifacts that were the basis of its inception.