Compute and Hardware
If you want to understand why frontier AI develops the way it does — why certain organizations lead, why progress follows the trajectory it does, why geopolitics has entered AI policy — you must understand compute. Compute is not a technical detail. It is the fundamental scarce resource around which the entire frontier AI ecosystem is organized.
Why GPUs Dominate AI Training
A central processing unit (CPU) is designed for sequential, general-purpose computation. It has a small number of powerful cores — typically 8 to 64 — optimized for executing complex instructions one after another quickly. A graphics processing unit (GPU), by contrast, has thousands of simpler cores designed to perform the same operation on many data points simultaneously. This architecture is called SIMD: single instruction, multiple data. The core operation in training a neural network is matrix multiplication — multiplying together enormous rectangular arrays of numbers. This operation is embarrassingly parallel: each element of the output depends only on a row and a column of the inputs, so millions of multiplications can be performed simultaneously. GPUs are extraordinarily good at this. A modern AI-accelerator GPU like NVIDIA's H100 can perform roughly 2,000 trillion floating-point operations per second (2 petaFLOPS) for the data types common in AI training. NVIDIA dominates the AI accelerator market as of the mid-2020s, but competitors are emerging. Google has developed custom chips called Tensor Processing Units (TPUs), used extensively in Google's own training runs. AMD produces GPU alternatives. A wave of AI-specific chip startups — Cerebras, Graphcore, Groq — have designed chips with different architectural tradeoffs. Despite this competition, NVIDIA's CUDA software ecosystem gives it enormous staying power: the vast majority of AI software is written for CUDA and runs on NVIDIA hardware.
Raw FLOPS (floating-point operations per second) capture only part of what makes an AI chip fast. Memory bandwidth — how quickly data can be moved between the chip's memory and its compute cores — is equally critical. Modern LLM training is often bottlenecked by memory bandwidth, not raw arithmetic speed. This is why chip designers spend enormous effort on high-bandwidth memory (HBM) stacking.
From a Single GPU to a Training Cluster
A single GPU, no matter how powerful, cannot hold a frontier model's parameters in memory or train it in a reasonable time. Frontier training runs use thousands to tens of thousands of GPUs working together. Coordinating them introduces its own engineering challenges. The GPUs must be connected with extremely high-bandwidth, low-latency interconnects so that gradients (the error signals used to update model parameters) can be synchronized across all of them. Within a single server, NVIDIA's NVLink provides fast GPU-to-GPU communication. Between servers in a data center, InfiniBand networking provides the inter-node bandwidth — far faster than standard Ethernet but also far more expensive and complex to configure. Three forms of parallelism are used to distribute a training run across thousands of GPUs. Data parallelism splits the training batch across GPUs, each of which processes a different slice and then averages gradients. Tensor parallelism splits individual matrix operations across GPUs, which requires very fast communication between them. Pipeline parallelism splits the layers of the model across different groups of GPUs, passing activations forward and gradients backward through the pipeline. Real frontier training runs combine all three — a technique called 3D parallelism. Orchestrating this requires specialized software (such as NVIDIA's Megatron-LM framework or Meta's FSDP) and engineers who deeply understand both the hardware and the distributed systems problems. A misconfiguration that causes one GPU to sit idle while others wait for it can reduce overall efficiency by 30% or more — a catastrophic waste at frontier scale.
Match each hardware or networking concept to its correct description.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Data Centers and Power
A frontier training cluster does not live in a server room. It lives in a purpose-built hyperscale data center. A cluster of 10,000 H100 GPUs draws roughly 30 to 50 megawatts of power — enough to supply a small city. The heat generated must be removed by industrial cooling systems. The power must be delivered reliably; a two-second power interruption can corrupt an ongoing training run. Data center location is therefore a strategic decision. Labs seek locations with access to cheap, reliable electricity (often from hydroelectric or nuclear sources), cool climates (to reduce cooling costs), and fiber connectivity. Microsoft's investment in nuclear power for AI data centers and the construction of large GPU clusters in regions with cheap renewable electricity reflect this logic. The physical scarcity of compute became geopolitically visible in 2022 and 2023 when the United States government imposed export controls on advanced AI chips to China. Because NVIDIA's H100 and its successors are manufactured using equipment from a small number of specialized suppliers — primarily ASML in the Netherlands — the supply chain for frontier compute is genuinely concentrated. Export controls can meaningfully slow a nation's ability to build frontier AI systems.
Because advanced AI chips depend on extreme-ultraviolet lithography machines made by a near-monopoly supplier (ASML), governments can influence the global distribution of AI capability by controlling chip exports. This makes semiconductor policy one of the most consequential AI governance levers available to policymakers today.
Why are GPUs preferred over CPUs for training large neural networks?
A frontier training run achieves only 65% of the theoretical peak FLOPS of its GPU cluster. Which explanation is most technically plausible?
Estimate the Cost of a Frontier Training Run
- Use publicly available figures to build a rough cost estimate for a frontier training run. You will make assumptions — document each one.
- Step 1: Assume a training run uses 10,000 NVIDIA H100 GPUs for 90 days. Cloud providers charge roughly $2 to $4 per GPU-hour for H100s. Calculate the total GPU-hours and the range of total compute cost.
- Step 2: Add a 20% overhead for networking, storage, power, and cooling infrastructure — these are real costs not captured in raw GPU pricing.
- Step 3: Research the publicly reported training costs for GPT-4, Gemini Ultra, or Claude 3 Opus (use any credible public source). How does your estimate compare?
- Step 4: Calculate how many years of a median software engineer's salary ($150,000/year) your estimated compute cost equals. What does this comparison reveal about where the money goes in a frontier lab?
- Step 5: Write two sentences on what this cost structure implies for which organizations can realistically build frontier models.