Skip to main content
Frontier & Future AI

⏱ About 20 min20 XP

Reasoning Models and Test-Time Compute

For most of the history of machine learning, compute was something you spent during training. Once a model was trained, inference — generating an answer — was fast and cheap. The model would receive a prompt and produce a response in a single forward pass through its billions of parameters, typically in seconds. This forward-pass cost was roughly fixed regardless of how hard the question was. A simple greeting and a PhD-level mathematics problem cost essentially the same amount of compute to answer. In 2024, a new class of models changed this picture fundamentally. Reasoning models — exemplified by OpenAI's o1 and o3, and Anthropic's Claude models with extended thinking — deliberately spend more compute at inference time on harder problems. They think before they answer. The harder the problem, the longer they think. This approach, called test-time compute scaling, represents one of the most significant architectural shifts in frontier AI in recent years.

What Test-Time Compute Means

Test time refers to the moment when a trained model is used — when you give it a prompt and it generates a response. Training time is when the model learns from data; test time is when it applies what it learned. For most models, the computation at test time is a single deterministic pass through the network: input goes in, output comes out. Test-time compute scaling means allocating additional computational resources at inference. There are several mechanisms for doing this. The simplest is generating multiple candidate answers and selecting the best one — a technique called best-of-N sampling. If you generate 64 answers to a math problem and pick the one that passes the most internal consistency checks, your answer quality improves even if the underlying model is unchanged. More powerful is the approach used by reasoning models: a trained internal deliberation process. The model generates an extended chain of thought — sometimes hundreds or thousands of tokens of working — before producing a final answer. This working is often not shown to the user; it is the model's internal scratchpad. The key insight is that the model was trained to use this scratchpad effectively, via reinforcement learning on problems where the correct answer can be verified. By rewarding the model when its deliberation leads to correct answers, training teaches it to use the scratchpad for genuinely useful computation rather than filler text.

Why Deliberation Works

Human experts do not answer hard problems the same way they answer easy ones — they slow down, check their work, consider alternatives, and backtrack when something seems wrong. Reasoning models have learned to do something analogous: allocate more internal processing to problems that require it. The result is that test-time compute scales performance on hard problems in a way that simply making the model larger does not always achieve.

The o1 and Extended Thinking Paradigm

OpenAI's o1 model, released in September 2024, demonstrated that training a model to think before answering — via reinforcement learning on verifiable problems — produced dramatic improvements on hard reasoning benchmarks. On the American Invitational Mathematics Examination (AIME), a competition-level exam where GPT-4o scored around 12%, o1 scored 74%. On a PhD-level science benchmark called GPQA Diamond, o1 exceeded the average performance of domain experts. These are not incremental improvements; they represent qualitatively different performance on tasks that were previously considered very difficult for AI. The key to o1's design is that its extended reasoning is trained, not just prompted. Earlier chain-of-thought prompting asked the model to show its work; o1 was reinforcement-trained to generate reasoning traces that actually lead to correct answers on problems where ground truth can be verified — math, coding, and formal logic. This is a subtle but crucial distinction: the model learned that certain patterns of internal deliberation improve its accuracy, because it was explicitly rewarded for correct answers rather than for producing plausible-sounding reasoning. Anthropic's extended thinking mode works on a similar principle: Claude is given a reasoning budget and uses it to deliberate — exploring solution paths, checking consistency, and revising conclusions — before delivering its final answer. The visible 'thinking' output represents genuine intermediate computation, not post-hoc rationalization.

Fill in the blanks to complete the key description of reasoning model design.

Test-time compute scaling means spending more during , rather than only during training. Reasoning models are trained using learning on problems with verifiable answers, so the model learns to generate deliberation that genuinely improves .

Trade-offs and Limitations

Test-time compute scaling is not a free lunch. The most obvious cost is latency and expense: a model that thinks for thirty seconds before answering uses substantially more compute than one that answers in two seconds. For interactive applications where users expect fast responses, this is a real constraint. For high-stakes tasks — medical diagnosis support, legal reasoning, safety-critical engineering calculations — the latency cost is often worth paying. A subtler concern is faithfulness: does the visible chain of thought actually represent what the model is doing, or is it a post-hoc narrative? Research suggests the chain of thought is causally connected to the output — perturbing it changes the answer — but the model may also produce plausible-sounding reasoning that obscures the true computational path. Interpretability in reasoning models remains an open problem. Finally, test-time compute helps most on problems with verifiable ground truth — math, code, logic puzzles — where reinforcement learning can assign clear rewards. On problems requiring subjective judgment, creative synthesis, or ethical reasoning, the benefit is harder to measure and the training signal is noisier.

Reasoning Is Not Omniscience

Even reasoning models that spend many seconds deliberating make mistakes, especially outside the domains they were reinforcement-trained on. A model that scores 74% on competition mathematics still misses 26% of problems. Extended thinking makes models better, not infallible. Always verify high-stakes outputs independently.

Match each concept to the accurate description of it in the context of reasoning models.

Terms

Test-time compute
Best-of-N sampling
Reinforcement learning on verifiable tasks
Chain-of-thought prompting
Faithfulness of reasoning traces

Definitions

Training the model by rewarding it when its reasoning leads to provably correct answers
Generating multiple candidate answers and selecting the best according to a scoring criterion
Instructing a model to write out reasoning steps before its final answer, without specialized training
Computation spent during inference, after training, to improve answer quality
Whether the visible reasoning output actually reflects the model's internal computation path

Drag terms onto their definitions, or click a term then click a definition to match.

On a competition mathematics exam, o1 scored 74% while GPT-4o scored 12%. What is the most accurate explanation for this difference?

Why does test-time compute scaling have greater impact on mathematics and coding than on open-ended creative writing?

Test-Time Compute in Action

  1. This activity compares standard and extended-reasoning model responses on the same hard problem.
  2. If you have access to a reasoning model (Claude with extended thinking, o1, or similar), complete both parts. Otherwise, complete Part A only.
  3. Part A — Design the experiment: Choose a problem that requires multi-step reasoning: a competition math problem, a logic puzzle with several constraints, or a complex coding challenge. Write down what you expect a standard model to get wrong and why.
  4. Part B — Run the comparison: Give the problem to a standard model and a reasoning model. Record the answer and whether intermediate steps are visible. Compare: Did the reasoning model's deliberation process reveal any steps the standard model skipped? Did extended deliberation lead to a different (better?) answer?
  5. Part C — Analyze trade-offs: How long did the reasoning model take compared to the standard model? For what kinds of tasks would you accept that latency cost, and for what kinds would you not? Write two paragraphs defending your position.