Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Data at Frontier Scale

Every frontier model is shaped, at a fundamental level, by the data it was trained on. The model's knowledge of chemistry, its facility with Python, its understanding of historical events, its stylistic tendencies, and its embedded biases all originate in decisions made during data collection and curation — decisions that often receive far less public attention than model architecture choices. Data is not a solved problem at frontier scale. It is one of the most contested and consequential engineering challenges in the field.

How Much Data Does a Frontier Model Need?

Early large language models were trained on data measured in gigabytes. GPT-2 (2019) trained on roughly 40 gigabytes of text from web pages. Within a few years, frontier models were consuming data measured in terabytes, and the datasets underpinning models like GPT-4 and Claude 3 are estimated to contain several trillion tokens — roughly equivalent to millions of books. A token is the basic unit models process: roughly 0.75 words, or about 4 characters of English text. Three trillion tokens corresponds to approximately 15 trillion characters, or enough text to fill 15 million average-length books. The practical challenge is that the open internet, while enormous, does not contain infinite high-quality text. Labs training on trillions of tokens eventually must make careful decisions about what counts as high-quality data and how to weight different sources.

The Chinchilla Scaling Laws

In 2022, DeepMind published research showing that most large models at the time were undertrained relative to their parameter count. The paper, known informally as 'Chinchilla,' found an optimal relationship: for a fixed compute budget, you should train a model on roughly 20 tokens per parameter. A 70-billion-parameter model should see about 1.4 trillion tokens for optimal compute efficiency. This finding reshaped how labs balance model size against training data volume.

Data Sources: Where Does the Data Come From?

Frontier training corpora draw from multiple source categories, each with different characteristics. Web crawls are the largest single source. Organizations like Common Crawl regularly scrape the publicly accessible internet and release snapshots. A single Common Crawl snapshot contains petabytes of HTML. The challenge is that most of the raw internet is low-quality: spam, duplicated content, machine-generated junk, and incoherent text. Turning a raw web crawl into useful training data requires aggressive filtering. Books and long-form text provide high-quality, coherent language that differs structurally from the short, fragmented writing common on the web. Projects like Books3 and various digitized library collections have been widely used, though their legal status under copyright law is actively litigated. Code repositories, primarily GitHub, provide an enormous corpus of programming language. Models trained on substantial code demonstrate significantly improved logical reasoning, even on non-coding tasks — a finding that has shaped how frontier labs blend data sources. Scientific papers, Wikipedia, and curated encyclopedic sources provide factual density. Wikipedia alone, despite covering fewer total tokens than a large web crawl, is given high weight in many training mixes because of its relatively high factual accuracy and consistent structure. Synthetic data — text generated by a previous AI model — is increasingly used to fill gaps in coverage or to provide training examples in specialized domains. This practice raises questions about data provenance and the risk of 'model collapse' if models are trained too heavily on their own outputs.

Match each data source category to its most important characteristic or challenge.

Terms

Web crawls
Code repositories
Wikipedia
Synthetic data
Books and long-form text

Definitions

Fills coverage gaps but risks model collapse if models train heavily on their own outputs
High factual density and consistent structure, given extra weight in many training mixes
Improves logical reasoning even on non-coding tasks
Largest available source but requires aggressive filtering to remove spam and low-quality content
High-quality coherent language but faces active copyright litigation

Drag terms onto their definitions, or click a term then click a definition to match.

Cleaning, Filtering, and Deduplication

Raw data from any source is not training-ready. A typical data pipeline applies multiple filtering stages. Language identification removes text not in the target language(s). Quality filtering uses heuristics or classifier models to score text for coherence, appropriate length, and absence of spam signals. A common heuristic is to check whether the text could plausibly have been written by a human producing genuine content. Pages with very high or very low word-per-sentence ratios, excessive repetition, or high proportions of special characters are often excluded. Deduplication removes near-duplicate documents. This matters for two reasons. First, memorizing duplicated content is not learning structure — it is rote storage. Second, heavily duplicated content distorts what the model learns to predict, causing it to over-represent whatever content happens to appear many times. Exact-match deduplication catches verbatim copies. MinHash or SimHash algorithms catch near-duplicates — documents that share most but not all of their text. Content filtering removes material the lab judges harmful or inappropriate for training. Exact hash matching against known databases of child sexual abuse material (CSAM) is standard. Additional filters attempt to reduce the presence of violent, hateful, or privacy-violating content, though these are imperfect. Data mixing assigns weights to different sources so that the final training corpus reflects a deliberate balance. A lab might decide that one token of high-quality curated text is worth ten tokens of raw web crawl, expressing this as a sampling probability.

What the Filter Decides

Every filtering decision embeds a value judgment. Filtering for 'coherent English' underrepresents languages and dialects. Filtering for 'factual accuracy' requires someone to decide what counts as a fact. These decisions shape what the resulting model knows, how it speaks, and whose perspectives it reflects. Data curation is a form of editorial power, often exercised without democratic accountability.

According to the Chinchilla scaling laws, if a lab plans to train a 30-billion-parameter model with optimal compute efficiency, approximately how many training tokens should they target?

A frontier lab runs its web crawl data through a quality filter that removes documents with very high repetition. Which training outcome does this most directly prevent?

Audit a Slice of the Web

  1. This activity builds intuition for why raw web data needs aggressive filtering before it can be used for AI training.
  2. Step 1: Go to commoncrawl.org and read their overview of what a web crawl captures. Note two characteristics of raw web content that would make it problematic as direct training data.
  3. Step 2: Open any five web pages at random — ideally a mix of a news site, a forum, a product page, a blog, and a Wikipedia article. For each page, rate its quality as AI training data on a 1-5 scale and write one sentence explaining your rating.
  4. Step 3: Design a simple three-rule filter you would apply to a web crawl to improve quality. For each rule, predict one type of legitimate content it might accidentally remove.
  5. Step 4: Consider: who decides what counts as 'high quality'? Write a short paragraph on the values embedded in your three rules and whose perspectives might be underrepresented if those rules were applied at scale.