Skip to main content
Machine Learning & Deep Learning

⏱ About 20 min20 XP

Attention and Transformers

In 2017 a paper titled 'Attention Is All You Need' discarded recurrence entirely and built a language model from a mechanism so simple it can be described in a single paragraph. Within two years, Transformer-based models — BERT, GPT-2, T5 — had broken every major natural language processing benchmark. Within four years, the same architecture had migrated to vision, protein folding, audio, and video. Understanding attention is now foundational to understanding modern AI.

The Attention Mechanism

Attention answers a question: given a position in a sequence, which other positions are most relevant to understanding it? In the encoder-decoder RNN framework, researchers first added attention as an extension: instead of forcing the decoder to rely on a single context vector, the decoder was allowed to 'look back' at all encoder hidden states and compute a weighted average of them. The weights were learned and varied at each decoding step — when translating the word 'cat,' the model would attend most strongly to the encoder positions corresponding to the source word 'chat' (in French-to-English translation), ignoring irrelevant words. The Transformer generalizes this into self-attention: every position in the sequence attends to every other position in the same sequence. For each position, the mechanism computes three vectors from the input embedding: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I provide?). The attention score between position i and position j is the dot product of Query_i and Key_j, scaled by the square root of the dimension d_k to keep magnitudes stable: score(i, j) = (Q_i · K_j) / sqrt(d_k) All scores for position i are passed through a softmax, producing a probability distribution over all positions. The output for position i is then a weighted average of all Value vectors, where the weights are those probabilities. In matrix form: Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V The Transformer uses multi-head attention: it runs h independent attention operations in parallel (each with its own learned Q, K, V projections), concatenates the results, and projects back to the model dimension. Different heads can specialize — one head learning syntactic relationships, another learning coreference, another learning positional proximity.

What Attention Actually Computes

Attention is a learned, content-based retrieval system. Each position broadcasts a Query asking 'what do I need?'; every position broadcasts a Key saying 'here is what I have'; and the softmax over dot products determines how much of each position's Value flows to the querying position. The entire operation is differentiable and learned end-to-end from data.

A full Transformer encoder layer wraps multi-head attention with two additions. First, a residual connection: the input to the attention sublayer is added directly to its output (output = input + Attention(input)). This allows gradients to flow cleanly through many layers, enabling very deep networks. Second, layer normalization stabilizes training by normalizing activations across the feature dimension within each example. A feedforward sublayer follows, also with residual connection and normalization. Because self-attention has no inherent notion of position — the dot product between any two positions is the same regardless of whether they are adjacent or far apart — Transformers add positional encodings to the input embeddings. The original paper used fixed sinusoidal functions of position; modern models often use learned positional embeddings or relative position encodings. The Transformer decoder adds a third sublayer: cross-attention, where the Query comes from the decoder's current position and the Keys and Values come from the encoder's output. This is how the decoder 'reads' the encoded source sequence at each generation step. Two dominant usage patterns emerged. Encoder-only models (BERT) process the full input bidirectionally — every position attends to every other — and excel at tasks requiring understanding: classification, question answering, named entity recognition. Decoder-only models (GPT series) use causal masking — each position can only attend to positions before it — and excel at generation. Encoder-decoder models (T5, BART) handle translation and summarization.

Fill in each blank with the correct term.

In self-attention, the three learned projections are Query, Key, and . The scores are normalized with a function to produce attention weights that sum to one.

Why Transformers Transformed the Field

Three properties explain the Transformer's dominance. Full parallelism: Self-attention computes all pairwise interactions simultaneously. The entire sequence is processed in one matrix multiplication, not T sequential steps. This maps perfectly to GPU and TPU hardware, enabling training on datasets orders of magnitude larger than were practical for RNNs. Direct long-range connections: In an RNN, information from position 1 must pass through T-1 hidden states to reach position T. In a Transformer, position 1 attends directly to position T in a single layer. Long-range dependencies are no harder to learn than short-range ones. Scalability: Transformers improve predictably as you increase model size, data, and compute — a property empirically documented in scaling laws. GPT-3 (175 billion parameters) showed that simply scaling a decoder Transformer produced emergent capabilities that smaller models lacked. This encouraged the push to ever-larger models. The cost is quadratic: computing attention between all pairs in a sequence of length n requires O(n^2) operations and O(n^2) memory. For n=512 this is fine; for n=100,000 (long documents, high-resolution images) it becomes prohibitive. Active research (sparse attention, linear attention approximations, state-space models) addresses this.

Transformers Are Not Free

Training a large Transformer from scratch requires thousands of GPU-hours and megawatts of energy. The quadratic attention cost limits context length. And despite their power, Transformers do not inherently understand the world — they learn statistical patterns in data and can generate fluent, confident text that is factually wrong. Capability does not equal reliability.

Why does multi-head attention use several independent attention operations rather than one large one?

A decoder-only Transformer generating text uses causal masking. What does this prevent, and why is it necessary for text generation?

Attention Weight Interpretation

  1. Step 1. Write this sentence on paper: 'The trophy did not fit in the suitcase because it was too big.'
  2. Step 2. The word 'it' is ambiguous. Does it refer to 'trophy' or 'suitcase'? Resolve it by reading the sentence.
  3. Step 3. Imagine you are a self-attention head trying to resolve 'it'. Write a list of the other words you would give high attention weight to, and justify each choice.
  4. Step 4. Now consider the sentence: 'The trophy did not fit in the suitcase because it was too small.' Which word does 'it' refer to now? What changed in your attention weights?
  5. Step 5. This task — coreference resolution — is one that BERT-like models solve via attention. Discuss: what does your analysis suggest about what an attention head must 'learn' to resolve pronouns correctly?