Deep Learning Investigation
You have spent eight lessons building a vocabulary for how deep learning works, where it came from, what it costs, and where it breaks. Now it is time to use that vocabulary on something real. Investigations are how knowledge becomes skill — reading about cycling does not make you a cyclist; you have to ride. This lesson makes you analyze.
The Application: YouTube Recommendations
YouTube recommends videos to over two billion users. The goal is to predict which video each specific user is most likely to watch next and enjoy — from a library of over 800 million videos. This is a deep learning problem at an almost incomprehensible scale. The architecture is a multi-stage recommendation system. The first stage, called candidate generation, uses a relatively simple network to rapidly narrow 800 million videos down to a few hundred candidates for a given user. It uses features like your watch history, search history, and demographic information as inputs and produces a compact vector representing your current taste. The second stage, called ranking, uses a deeper, more complex network to score each of those few hundred candidates very carefully. It considers much richer features: how long you typically watch a video before stopping, how recently you watched a similar video, how the video has performed with similar users, the time of day, whether the thumbnail changed recently. The training data is your own behavior — clicks, watch time, likes, skips — logged continuously for hundreds of millions of users. The feedback signal is watch time, not clicks, because YouTube found that optimizing for clicks alone produced sensationalist content that users watched briefly and then rated as unsatisfying. This single design choice — watch time instead of clicks — had massive consequences for what content the system promoted. It rewarded longer videos. It rewarded content that created tension or emotional engagement. Researchers have documented that optimizing watch time can inadvertently push users toward increasingly extreme content because extreme content is emotionally engaging. YouTube has since added additional signals and human review to counter this.
A recommendation system does what it is trained to do — maximize whatever signal you defined as success. If that signal is imperfectly aligned with user well-being, the system will optimize hard for the imperfect proxy. Choosing the right training objective is one of the most consequential decisions in designing a deep learning system.
Failure Mode Analysis Let us apply Lesson 7's framework to YouTube recommendations. Brittleness: The model is trained on historical behavior. A new user has no history, so the model has nothing to work with — this is called the cold-start problem. It often recommends generic popular content to new users regardless of their actual interests. Bias: If historically certain communities or topics were under-represented in the training data, those communities' content may receive fewer recommendations, making it harder for those creators to build an audience. The system encodes and perpetuates existing inequalities in creator visibility. Hallucination (less applicable here): Recommendation models do not produce text, so classical hallucination is not the primary risk. But they can surface misleading or false information in recommended videos — the content quality problem is a close analog. Adversarial examples: Creators have learned to game the recommendation system — using clickbait thumbnails, planting trending keywords, releasing videos at times that exploit the model's patterns. These are adversarial inputs designed to exploit the system's learned decision boundary. Oversight in Practice YouTube's recommendation system sits at an intersection of advertising revenue, creator income, and user time. The company has added human editorial teams to review trending content, policy teams that set rules the model must follow, and transparency reports. Critics argue these measures are insufficient given the system's global reach. This is a real, ongoing debate — not a solved problem. The same question recurs with every deep learning system deployed at scale: who is responsible for what the optimization produces?
Investigation Framework
For any deep learning application you encounter, these four questions form a useful investigation framework. 1. What is the input and what is the output? What data does the system consume and what decision or content does it produce? 2. What was the training objective? What signal did the system optimize for? Is that signal a perfect proxy for the outcome you actually want? 3. Which failure modes are most likely and most harmful? Of brittleness, bias, hallucination, and adversarial exploitation — which applies here and with what severity? 4. Who has oversight? Is there a human in the loop for consequential decisions? Who is accountable when it fails? These questions do not require access to the model's code or weights. They require only understanding the mechanism — which you now have.
You will rarely have complete information about a deployed AI system. A good investigation states what you know, what you inferred, and what you could not determine. Intellectual honesty about the limits of your analysis is part of the skill.
Flashcards — click each card to reveal the answer
Why did YouTube switch from optimizing for clicks to optimizing for watch time?
Which investigation question is most directly about alignment between the training objective and actual user well-being?
Investigate an AI System
- Choose an AI system from your own life: a music streaming recommendation, a search engine, an AI writing assistant, a content feed, or another.
- Write a one-page investigation using the four-question framework from this lesson.
- For each question, write two to three sentences that are as specific as possible about your chosen system.
- At the end, write one sentence stating the most important thing you do not know about this system and how you might find out.
- Share investigations in groups of three. Which system generated the most concern? Which had the best oversight?