Skip to main content
AI Foundations

⏱ About 15 min15 XP

How Image Generators Work

Text generators, as you learned in Lesson 2, predict the next token based on patterns in language. But what does prediction even mean for images? A 512-pixel-wide image has 262,144 pixels, each with a color value — the 'vocabulary' of possible images is astronomically larger than the vocabulary of language. Yet image generators produce stunning, coherent visual art from a few words of description. The mechanism is brilliant, and understanding it will sharpen the way you use these tools.

Learning from Image-Caption Pairs

Modern image generators are trained on datasets containing hundreds of millions — sometimes billions — of images, each paired with a text caption or alt-text description. The model's job during training is to learn the relationship between visual content and language. It learns that 'a golden retriever on a beach at sunset' involves specific colors, textures, lighting conditions, compositions, and subject arrangements. It learns this not from rules written by artists, but from seeing the pattern play out across millions of examples. This joint training on images and text is what allows the model to accept a text prompt and produce a relevant image. The model has built a shared space where visual concepts and language concepts are tightly linked.

How Image Generators Learn

Image generators are trained on massive datasets of image-caption pairs. By learning to connect text descriptions with visual content across millions of examples, the model builds a shared conceptual space where language and imagery are linked — enabling it to create images from written descriptions.

The most widely used family of image generators today uses a technique called diffusion. Here is how it works, conceptually. During training, the model repeatedly takes a real image and adds random noise to it in controlled steps — like slowly frosting a window until you can no longer see through it. Then it learns to reverse this process: given a noisy image, predict what the clean image underneath looked like. By training on millions of examples of this noise-then-denoise cycle, the model becomes expert at reconstructing images from static. At generation time, the model starts with pure random noise — just static — and then runs the denoising process in reverse, guided by your text prompt. Each step makes the image slightly less noisy and slightly more aligned with your description. After 20 to 50 denoising steps, a coherent image emerges from what was pure randomness. The text prompt acts as a compass pointing the denoising process toward the right visual territory.

Why the Output Is Never Quite Real

Images produced by diffusion models have a distinctive character. They are often strikingly beautiful and eerily plausible — but they are not photographs. Close inspection typically reveals tells: hands with too many or too few fingers, teeth that blur strangely, text in images that is garbled, or backgrounds that become inconsistent at the edges. These artifacts happen because the model learned statistics, not physics. It knows that human faces have two eyes, a nose, and a mouth in roughly the right places — but it does not have a geometric model of the human hand that guarantees exactly five distinct fingers. When asked to generate something rare or compositionally complex, the statistical patterns can break down in interesting ways. This also explains why generated images do not contain real people in accurate situations. The model generates plausible likenesses based on patterns, not specific individuals. (Though this creates serious ethical issues — we will touch on those in Lesson 8.)

Flashcards — click each card to reveal the answer

The Hands Problem

A well-known limitation of current image generators is difficulty rendering human hands correctly. Fingers are often added, removed, fused, or bent in anatomically impossible ways. This is a statistical failure: hands in varied poses, at different sizes and angles, are hard to learn accurately from image data alone. It is a reminder that these models capture patterns, not physical laws.

What does a diffusion model start with when generating a new image?

Why do image generators sometimes produce hands with the wrong number of fingers?

Prompt and Predict

  1. Without using an image generator (just your knowledge from this lesson), write three text prompts you think would be easy for an image generator to handle well, and three you think would be difficult.
  2. For each difficult one, explain WHY you think it would be hard — based on what you now know about how diffusion models learn.
  3. If you have access to an image generator (like Adobe Firefly or a classroom-approved tool), test your predictions. Did the model struggle where you expected?
  4. Share your findings: what makes a subject or scene compositionally hard for a diffusion model?