The Pile [17] is an 825GB dataset of diverse text, which has gained a lot of popularity in large language model research [34, 48, 10, 22, 13, 8]. It mainly differs from other datasets in its diversity: it contains 22 sub-datasets, which can be roughly categorized into webpages, dialogue, books, science, and code [62], with their proportions shown in Figure 1. To train language models on diverse datasets of the Pile’s size in reasonable training time durations, one requires access to expensive computing resources.
However, less well-funded ML researchers do not have access to supercomputers and typically
Figure 1: MiniPile and Other Pre-Training Datasets.
fall back on using small-scale, homogeneous datasets unrepresentative of contemporary generalpurpose language models. For example, the popular enwik8 [39] / WikiText103 [40] corpora (0.1/1GB large) are still being heavily used for validation of novel research ideas, despite them consisting of only Wikipedia articles and being relatively small. Nagatsuka et al. [43] show that pre-training a BERT model on Wikitext103 results in GLUE downstream performances much worse than the original BERT model [12], which was trained on a 16GB corpus.
In this work, we aim to fill in this gap by introducing MiniPile, a curated subset of the Pile [17] that comprises 1 million documents and an uncompressed volume of 6GB. Our goal is to facilitate research on data-efficient language model pre-training, joining a broader line of recent work challenging the need for ever-growing computational resources [51, 23, 61, 18, 44].
To curate MiniPile and filter out documents we consider harmful or low-quality, we cluster the embedding space of the Pile documents using a state-of-the-art embedding model. Then, we fil-ter out unwanted clusters, with rationales provided in Section 2.1. Lastly, we provide first evidence
for MiniPile being an information-rich pre-training dataset by pre-training a BERT-/T5-Base model on it. After fine-tuning with the GLUE [55]/SNI [57] benchmark data, our pre-trained models reach reasonable downstream performances with only small drops compared to models pre-trained on much bigger datasets.
Our Pile pruning pipeline consists of three steps: (1) document embedding extraction, (2) clustering of embeddings, and (3) human-guided exclusion of unwanted clusters.
Our starting point is the deduplicated The Pile dump, released by EleutherAI on the HF hub1.
First, we infer embeddings for all documents using E5-Large (EmbEddings from bidirEctional Encoder rEpresentations) [56], a state-of-the-art text embedding model, which achieves excellent performance on the MTEB benchmark [42].
Second, we cluster the embeddings, motivated by recent work demonstrating clusterability of data subset embeddings [25, 54]. For the clustering, we use batchified k-means clustering with the cosine distance between normalized embeddings, k = 220 (10 clusters per the Pile subset), and batch size 16384. We examined random examples across different clusters and found clear semantic boundaries. For example, we find both clusters matching the high-level categorization from Figure 1 and more fine-grained categories, such as pure mathematics, physics, different programming languages, real estate listings, sports/crime/politics news, etc.
Third, to decide whether to keep or drop a cluster, we first sort the documents within each cluster by their distance to their assigned centroid [54]. Then, a human annotator (the author) judges the data quality based on the five closest and five most distant examples. We stress that this is a rough estimate of the entire cluster’s data, and we may unintentionally exclude some valuable examples; nonetheless, considering the immense size of the Pile and the numerous remaining clusters, we deem this approximation to be sufficiently accurate.
In preliminary BERT training runs, we also tried selecting only the top-l documents closest to the centroid of their assigned clusters, which one may interpret as excluding hard-to-learn outlier examples [54]. However, we observed worse GLUE
2.1 Excluded Clusters
Table 1: Examples of Excluded Clusters.
• Near-duplicate documents will contain repetitions, which have been shown to degrade model performance [33, 21, 1].
• Pornography may contain sexist spurious correlations and enforce racial/social stereotypes [9, 59].
• Webpage navigation bars/product specifica-tions/long named entity lists entail long-tail knowledge, which is challenging to learn even for large language models up to 176B parameters [28].
2.2 MiniPile Statistics
Table 2: Wall-Clock Times for our experiments using a single NVIDIA RTX 3090 GPU.
• 6/3GB un-/compressed space requirements
• Vocab size: 32309614
• Median document length: 294
• Longest document length: 929633
The primary goal of our experiments is to verify that MiniPile is information-rich enough for pre-training a language model, which reaches reasonable fine-tuning performances on standard downstream task benchmarks. We evaluate our pre-trained models on the General Language Understanding Evaluation (GLUE) [55] and Super-Natural-Instructions (SNI) [57] benchmarks.
We run all experiments on a machine with a single NVIDIA RTX 3090 GPU and highlight the wall-clock times in Table 2.
As a reference point for comparability, we list the performance obtained by fine-tuning a publicly available checkpoint of the same model architecture but trained on more data, following the same fine-tuning protocol. We emphasize that our goal is not to attain state-of-the-art performance on GLUE/SNI; specifically for these downstream benchmarks, data selected from target distribution [60] could be better suited. For example, Geiping and Goldstein [18] and Nawrot [44] reach downstream performances slightly better than ours, using randomly sampled subsets of C4 [49].
3.1 BERT-style Encoder-Only Masked Language Modeling
We pre-train a BERT-Base [12] model using a masked language modeling (MLM) objective. We adopt the Cramming training recipe [18] without further data filtering and use the WordPiece tokenizer with vocabulary size , Adam optimizer [29], , weight decay of 0.01 [35], one-cycle schedule [53] with peak learning rate 0.001, gradient clipping of 0.5, progressive batch size from 128 to 4096 with a linear increase over the course of training up to 300k steps, no warmup, 800k total training steps, and weight averaging of the k = 5 latest checkpoints and 1k steps distance between them [24].
3.2 T5-style Encoder-Decoder Span Corruption
We pre-train a T5v1.1-Base [49, 52] model using the original span-corrupting MLM objective and SentencePiece [31] tokenizer. We mostly follow [44] and use the AdamW optimizer [35] with matrix-wise LR scaling by its root mean square (RMS), base learning rate 0.02, no weight decay, cosine schedule with final of [36], gradient clipping of 1.0, batch size of 288, 10k warmup steps, 65536 total training steps, and weight averaging of the k = 5 latest checkpoints and 1k steps distance between them [24].
3.3 Discussion
Tables 3 and 4 show the results compared against the publicly available checkpoints trained on 2.6x/745x the amount of data, where we took the numbers from [18] and [44], respectively. We observe minor reductions in final downstream performance and conjecture that MiniPile is a wellsuited pre-training corpus for common downstream benchmarks. For future reference, Table 5 includes the performances of the pre-trained models on MiniPile’s dev and test set.
Pre-Training Datasets Various subsets of Wikipedia dumps have been used in language modeling papers, e.g., enwik8 [39], or WikiText [40]. Bookcorpus [4] contains >7k unpublished books with long stretches of contiguous text. OpenWebText [19] and C4 [49] contain text from crawled webpages. Concurrent with this work, Warstadt et al. [58] recently announced the BabyLM challenge, with under 100M words of transcribed speech. In contrast to the Pile and MiniPile, these corpora contain less-diverse text.
Data Quality The quality of datasets has been questioned in various works, especially in the context of massive collections of web-crawled data [47, 30]. Potential issues with such include token repetitions [33, 21, 1], misogyny, pornography, and malignant stereotypes [9, 7], benchmark data contamination [11, 14], spurious correlations [27, 37, 38], diluted robustness due to data mixing [45] and potentially sensitive information [20].
Table 3: GLUE-dev performances of BERT-base with results provided by Geiping and Goldstein [18], and our model pre-trained only on MiniPile.
Table 4: SNI performances of baseline T5v1.1-Base [44] and our model pre-trained only on MiniPile.
Table 5: Model performances averaged across the MiniPile dev and test set.
I thank Matt Kusner and Ari Morcos for their advice on using the k-means algorithm on text embeddings and Jonas Geiping for guidance on BERT pre-training.
[1] Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, and Ari S Morcos. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
[2] Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Ku- mar, and Pasin Manurangsi. Large-scale differentially private bert. arXiv preprint arXiv:2108.01624, 2021.
[3] Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Re- gan, and Yoram Singer. Scalable second order optimization for deep learning, 2021.
pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:// aclanthology.org/N19-1423.
[13] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, He- mant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster, 2023.
[14] Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758, 2021.
[15] N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
[16] Daniel Y Fu, Tri Dao, Khaled Kamal Saab, Armin W Thomas, Atri Rudra, and Christopher Re. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023.
[17] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
[18] Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day. arXiv preprint arXiv:2212.14034, 2022.
[19] Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019.
[20] Peter Henderson, Mark Simon Krass, Lucia Zheng, Neel Guha, Christopher D Manning, Dan Jurafsky, and Daniel E. Ho. Pile of law: Learning responsible data filtering from the law and a 256GB open-source legal dataset. In Thirtysixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/ forum?id=3HCT3xfNm9r.
[21] Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, et al. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
[22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
[23] Peter Izsak, Moshe Berchansky, and Omer Levy. How to train bert with an academic budget. arXiv preprint arXiv:2104.07705, 2021.
[24] Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. arXiv preprint arXiv:2209.14981, 2022.
[25] Jean Kaddour, Steindór Sæmundsson, et al. Prob- abilistic active meta-learning. Advances in Neural Information Processing Systems, 33:20813–20822, 2020.
[26] Jean Kaddour, Linqing Liu, Ricardo Silva, and Matt Kusner. When do flat minima optimizers work? In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?id=vDeh2yxTvuh.
[27] Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. Causal machine learning: A survey and open problems. arXiv preprint arXiv:2206.15475, 2022. URL https://arxiv. org/abs/2206.15475.
[28] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. arXiv preprint arXiv:2211.08411, 2022.
[29] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[30] Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022.
[31] Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
[32] Tian Lan, Deng Cai, Yan Wang, Heyan Huang, and Xian-Ling Mao. Copy is all you need. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= CROlOA9Nd8C.
[33] Katherine Lee, Daphne Ippolito, Andrew Nys- trom, Chiyuan Zhang, Douglas Eck, Chris CallisonBurch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
[34] Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.
[35] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[36] Ilya Loshchilov and Frank Hutter. SGDR: Stochas- tic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview. net/forum?id=Skq89Scxx.
[37] Aengus Lynch, Jean Kaddour, and Ricardo Silva. Evaluating the impact of geometric and statistical skews on out-of-distribution generalization performance. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022. URL https://openreview. net/forum?id=wpT79coXAu.
[38] Aengus Lynch, Gbètondji JS Dovonon, Jean Kad- dour, and Ricardo Silva. Spawrious: A benchmark for fine control of spurious correlation biases. arXiv preprint arXiv:2303.05470, 2023.
[39] Matt Mahoney. Large text compression benchmark, 2011.
[40] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
[41] Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, and Luke Zettlemoyer. Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349, 2022.
[42] Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB Leaderboard. https://huggingface.co/spaces/ mteb/leaderboard, 2023. Accessed: 17th April 2023.
[43] Koichi Nagatsuka, Clifford Broni-Bediako, and Masayasu Atsumi. Length-based curriculum learning for efficient pre-training of language models. New Generation Computing, pages 1–26, 2022.
[44] Piotr Nawrot. nanoT5, 3 2023. URL https:// github.com/PiotrNawrot/nanoT5.
[45] Thao Nguyen, Gabriel Ilharco, Mitchell Worts- man, Sewoong Oh, and Ludwig Schmidt. Quality not quantity: On the interaction between dataset design and robustness of clip. arXiv preprint arXiv:2208.05516, 2022.
[46] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. Incontext learning and induction heads, 2022.
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022.
[58] Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan Wilcox, and Chengxu Zhuang. Call for papers – the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus, 2023.
[59] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
[60] Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling, 2023. URL https: //arxiv.org/abs/2302.03169.
[61] Xingcheng Yao, Yanan Zheng, Xiaocong Yang, and Zhilin Yang. Nlp from scratch without large-scale pretraining: A simple and efficient framework. In International Conference on Machine Learning, pages 25438–25451. PMLR, 2022.
[62] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
[63] Rui-Jie Zhu, Qihang Zhao, and Jason K Eshraghian. Spikegpt: Generative pre-trained language model with spiking neural networks. arXiv preprint arXiv:2302.13939, 2023.