Networks That Understand Language
Language is slippery. The word 'bank' means something completely different in 'riverbank' and 'bank account.' The sentence 'I saw the man with the telescope' could mean two different things depending on who had the telescope. Computers had enormous trouble with this ambiguity for decades. Deep learning — specifically a design called the transformer — finally gave machines a way to handle it.
Turning Words Into Numbers
A neural network only works with numbers. Before any language model can process text, every word must become a number — or, more usefully, a vector: a list of dozens or hundreds of numbers that encodes the word's meaning in relation to every other word. Think of it as coordinates in a huge meaning-space. Words with similar meanings end up near each other in that space. 'Dog' and 'puppy' are close. 'Dog' and 'democracy' are far apart. Remarkably, directions in the space carry meaning: the direction from 'king' to 'queen' is roughly the same as the direction from 'man' to 'woman.' These number-lists are called word embeddings. Embeddings feed into the network's layers, which process the whole sequence of words together to figure out what the text means.
A transformer is a deep neural network architecture designed for sequences — text, speech, DNA. Its key innovation is the attention mechanism: instead of processing words one at a time from left to right, a transformer looks at all words simultaneously and learns which words to pay attention to when interpreting each word. This allows it to capture long-range relationships in a sentence or document.
Consider the sentence: 'The trophy did not fit in the suitcase because it was too big.' What does 'it' refer to — the trophy or the suitcase? A human knows immediately: the trophy was too big. A language model using attention can figure this out because attention lets the word 'it' look across the whole sentence and assign a high attention score to 'trophy' and a low score to 'suitcase.' GPT (Generative Pre-trained Transformer), the family of models behind ChatGPT, is trained on hundreds of billions of words from the internet, books, and code. The training task is simple in principle: predict the next word. Do that billions of times across billions of examples, adjust the weights each time the prediction is wrong, and the model gradually builds an internal model of grammar, facts, reasoning patterns, and style.
What Language Models Can and Cannot Do
Language models that emerged from transformer architectures are remarkably capable. They can translate between languages, summarize documents, answer factual questions, write code, explain concepts, and generate fluent prose in any style. But they have real limits rooted in how they work. They predict likely sequences of words based on patterns in training data — they do not look things up in a live database unless given special tools. They can produce confident-sounding text that is factually wrong, a problem called hallucination (covered in depth in Lesson 7). They encode the biases present in their training data. And they have no persistent memory of past conversations unless the conversation text is included in their input. Understanding the mechanism explains the limits. A model that learned by predicting text will sometimes predict plausible-sounding but false text.
A language model can write a beautifully grammatical paragraph containing completely wrong information. Fluency and accuracy are separate things. Always verify important facts from a language model against reliable sources.
Fill in the blanks to complete the key ideas from this lesson.
What is a word embedding?
Why can a transformer understand long-range relationships in a sentence better than older models?
Attention Mapping
- Write this sentence on paper: 'The student passed the exam because she studied hard.'
- Draw a line from the word 'she' to every other word in the sentence.
- For each line, rate attention strength: thick line = strong connection, thin line = weak, no line = irrelevant.
- Compare with a partner: which words did you both agree 'she' should pay most attention to?
- Discuss: why does a machine need to solve this same problem to understand the sentence?