论文标题

堆的数据表

Datasheet for the Pile

论文作者

Biderman, Stella, Bicheno, Kieran, Gao, Leo

论文摘要

该数据表描述了这堆,这是由Eleutherai编制的825 GIB数据集用于大规模语言建模。该堆由22个不同的文本源组成,范围从为该项目进行的原始刮擦到数据所有者提供的文本数据,再到在线可用的第三方刮擦。

This datasheet describes the Pile, a 825 GiB dataset of human-authored text compiled by EleutherAI for use in large-scale language modeling. The Pile is comprised of 22 different text sources, ranging from original scrapes done for this project, to text data made available by the data owners, to third-party scrapes available online.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源