Skip to main content
AI Foundations

⏱ About 20 min20 XP

What Is a Language Model?

Every time you use autocomplete on your phone, receive a grammar suggestion, or converse with an AI assistant, you are interacting with a language model. But what is one, exactly? Not in marketing terms — in precise, technical terms. The answer is simpler and more powerful than most people expect, and understanding it will anchor everything else you learn in this module about how modern AI systems work.

The Core Idea: Probability Over Sequences

A language model is a system that assigns a probability to any sequence of text. More specifically, it learns to estimate how likely each possible next word (or token) is, given everything that came before it. Formally: given a sequence of words w1, w2, ..., wn, a language model estimates P(wn | w1, w2, ..., wn-1) — the probability of the next word given the preceding context. This is called the conditional probability of a word given its history. Example: Given the partial sentence "The cat sat on the," a language model might assign: P("mat") = 0.18 P("floor") = 0.12 P("roof") = 0.07 P("elephant") = 0.0001 It does not know that cats prefer mats to elephants because it understands cats — it has learned statistical regularities from enormous amounts of text. The word "mat" follows "cat sat on the" far more often than "elephant" does in the text the model was trained on.

Core Definition

A language model is a probability distribution over sequences of text. At each step it estimates P(next token | all preceding tokens). Generation works by sampling from that distribution repeatedly — which is why outputs vary even for the same prompt.

This probabilistic framing has a crucial implication: language models do not look up facts in a database. They do not retrieve stored sentences. They generate each word by sampling from a learned probability distribution. When a model produces fluent, accurate text, it is because that text was statistically likely given its training data. When it produces plausible-sounding nonsense, the same mechanism is at work — the model is generating what is statistically likely, not what is true.

From Simple to Modern: A Brief History

The idea of modeling language statistically is not new. The field traces back to Claude Shannon's 1948 work on information theory, where he modeled English text as a probabilistic process. N-gram models (1980s-2000s): The simplest practical approach was to count how often each word follows a given sequence of n-1 words. A bigram model uses only the immediately preceding word; a trigram model uses the two preceding words. If you had seen "the cat sat" a thousand times in training text and "the cat sat on" eight hundred of those times, you estimate P("on" | "the cat sat") ≈ 0.8. These models worked reasonably well for short predictions — autocorrect on early phones used them. But they had a fundamental limit: they could not handle long-range dependencies. The word at the beginning of a paragraph almost never influenced predictions at the end. Neural network language models (2000s-2010s): Replacing the count tables with neural networks allowed models to generalize across similar words and capture somewhat longer context. But training was slow and context windows remained limited. Transformers and large language models (2017-present): The Transformer architecture, introduced in 2017, removed the fixed-context limitation by using attention mechanisms that can, in principle, relate any word to any other in the input. Trained at scale — billions of parameters, trillions of tokens — these models produce the AI assistants you interact with today. You will study the Transformer in Lesson 3.

Match each language model concept to its correct description.

Terms

Language model
N-gram model
Conditional probability
Token
Transformer

Definitions

The neural architecture behind modern large language models, using attention mechanisms
A system that assigns probabilities to sequences of text
The unit of text a model processes, often a word fragment rather than a full word
The likelihood of an event given that another event has already occurred
Predicts the next word by counting how often word sequences appeared in training data

Drag terms onto their definitions, or click a term then click a definition to match.

Why This Matters for the Rest of the Module

Every later lesson — tokenization, attention, pretraining, hallucination — makes most sense when you keep this core idea in mind: an LLM is doing one thing, very fast, at enormous scale: predicting what text comes next. The capabilities and the failures both follow from this.

A language model is best described as:

Why does an n-gram model struggle with long-range dependencies?

Build a Tiny Bigram Model

  1. You will construct a minimal language model by hand to see exactly how statistical prediction works.
  2. Step 1: Take the following short corpus (two sentences):
  3. 'the dog chased the cat'
  4. 'the cat climbed the tree'
  5. Step 2: Count all bigrams (word pairs). For example: (the, dog)=1, (dog, chased)=1, (chased, the)=1, (the, cat)=2, (cat, climbed)=1, (climbed, the)=1, (the, tree)=1.
  6. Step 3: Compute probabilities. After the word 'the', you have seen: dog (1), cat (2), tree (1) — total 4. So P(cat|the) = 2/4 = 0.5, P(dog|the) = 0.25, P(tree|the) = 0.25.
  7. Step 4: Generate a sentence. Start with 'the'. Sample from your probabilities. Continue generating word by word until you reach a word with no recorded successor.
  8. Step 5: Discuss: what happens if you add a third sentence to the corpus? How would the probabilities change? What kinds of sentences can your model never generate?