Language and Reasoning
Language is the capability that defines the current era of AI. Frontier language models generate fluent prose, follow intricate instructions, argue positions, write poetry, summarize long documents, and carry on coherent dialogue across hundreds of turns. But language ability and reasoning ability are not the same thing — and understanding precisely where one ends and the other begins is one of the most active and contested questions in AI research. This lesson draws the distinction carefully, examines what the evidence says, and prepares you to evaluate claims about AI reasoning with rigor.
The Transformer and Next-Token Prediction
Modern large language models (LLMs) are built on the transformer architecture, introduced by Vaswani et al. in 2017. The transformer's core innovation is the self-attention mechanism, which allows each position in a sequence to attend to — and be influenced by — every other position. This lets the model capture long-range dependencies in text: the word 'it' in a sentence can be resolved to an antecedent fifty words earlier without difficulty. During training, the model learns to predict the next token given all preceding tokens in a sequence. A token is roughly a word fragment — the word 'unbelievable' might be tokenized as 'un', 'believ', 'able'. This objective, called causal language modeling or next-token prediction, is self-supervised: no human labels are required. The training signal comes from the text itself. A model trained this way on trillions of tokens from the web, books, and code implicitly learns grammar, facts, reasoning patterns, argument structures, and stylistic conventions — not because these were explicitly taught, but because they are the regularities in the data that help predict the next token accurately.
One way to think about language model training: the model is learning to compress the statistical patterns of human-produced text into billions of floating-point parameters. To predict text well, it must capture not just word co-occurrence but the underlying structure of ideas, arguments, and facts. Prediction, at sufficient scale and depth, may require something resembling understanding.
What Frontier Language Models Can Do
The range of language tasks frontier models handle is broad. They translate between over 100 language pairs with near-human quality. They summarize documents ranging from legal contracts to scientific papers, preserving key claims and discarding filler. They follow complex, multi-part instructions — 'write a persuasive essay arguing position X, in the style of Y, addressed to audience Z, under 500 words, with three concrete examples' — with notable fidelity. They carry on multi-turn dialogue where the model tracks context across many exchanges, remembers commitments made early in the conversation, and maintains a consistent persona. Frontier models also show strong performance on formal language tasks: translating natural-language questions into SQL queries, converting requirements documents into structured data schemas, interpreting regular expressions and explaining what patterns they match. These tasks require precision — small errors produce wrong answers — and frontier models perform them with accuracy that would have seemed remarkable five years ago.
Flashcards — click each card to reveal the answer
The Reasoning Question
The question of whether frontier models genuinely reason — or whether they are very sophisticated pattern-matchers — is both philosophically deep and practically important. Here is what the evidence shows. On formal reasoning benchmarks such as GSM8K (grade-school math word problems), MATH (competition-level mathematics), and BIG-Bench Hard (designed to be difficult for language models), frontier models have improved dramatically. GPT-4 achieved over 90% on GSM8K — tasks that require multi-step arithmetic reasoning stated in natural language. On MATH, models that scored near zero two years ago now score above 70%. Chain-of-thought prompting significantly boosts performance: when models are asked to write out their reasoning step by step before giving an answer, accuracy on complex problems improves substantially. This matters because it suggests the model is doing something more than surface pattern-matching — the intermediate steps are causally contributing to the final answer. At the same time, honest accounting reveals limits. Models make systematic errors on problems that require true logical rigor — they can be tripped up by subtle scope shifts, negation, and problems that superficially resemble training examples but require genuinely novel reasoning. They also hallucinate confidently: they produce plausible-sounding but false factual claims, misattribute quotes, and fill in gaps in their knowledge with invented details. These failures suggest that language fluency and reliable reasoning are not the same thing, even when they frequently co-occur.
A model can write a grammatically perfect, stylistically polished, logically structured paragraph that contains multiple factual errors. Fluency — smooth, coherent, confident-sounding language — gives no guarantee of accuracy. This is one of the most practically dangerous properties of frontier language models, because human readers instinctively trust fluent text more than halting text, regardless of its truth value.
Match each language model behavior to the underlying mechanism or phenomenon that explains it.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A student claims: 'Language models cannot reason — they just predict the next word, so any apparent reasoning is fake.' What is the most accurate assessment of this claim?
Which of the following best explains why chain-of-thought prompting improves language model performance on multi-step math problems?
Probe the Limits of Language Model Reasoning
- This activity requires access to any publicly available frontier language model (GPT-4o, Claude, Gemini, or similar).
- Round 1 — Basic reasoning: Ask the model a multi-step math word problem. Record its answer and whether it shows its work.
- Round 2 — Chain-of-thought: Ask the same problem again, but add the instruction 'think step by step before answering.' Compare the result.
- Round 3 — Adversarial probe: Design a problem that superficially resembles a common type (e.g., a rate problem) but has a subtle twist that changes the correct approach. Does the model catch the twist?
- Round 4 — Hallucination test: Ask the model a specific factual question in your area of expertise — something you know the precise answer to. Does the model answer correctly, or does it confidently provide a wrong answer?
- Write up your findings: In what ways did the model behave as a reasoner? In what ways did it behave as a pattern-matcher? What does this suggest about how to use these tools responsibly?