The Cutting Edge
Research in AI moves fast enough that textbooks are obsolete before they are printed. But beneath the pace of announcements and benchmarks, there are durable ideas that define where the field is heading. This lesson examines three research directions that were reshaping AI capabilities as of 2026: multimodal models, AI agents, and tool use. The goal is not to cover every new system — the names will change — but to give you the conceptual vocabulary to understand what is new and why it matters.
Multimodal Models
Early deep learning models were specialized by modality: a model for images, a different model for text, a different model for audio. Multimodal models process multiple modalities within a single architecture, learning representations that bridge them. The key insight enabling multimodal learning is that different modalities often encode the same underlying concepts. A photograph of a dog and the word 'dog' refer to the same entity. If you train a model to align text and image representations — so that the representation of an image of a dog is close to the representation of the text 'dog' in a shared embedding space — the model develops transferable knowledge. CLIP (Contrastive Language-Image Pretraining, OpenAI, 2021) demonstrated this: trained on 400 million image-text pairs from the internet, CLIP could classify images into categories it had never seen before simply by comparing image embeddings to text embeddings of candidate category names. This zero-shot capability emerged from alignment training, not from labeled examples of each category. More recent multimodal models go further: they accept arbitrary interleaved sequences of text, images, and audio as input and generate text (and sometimes images) as output. GPT-4V (the V stands for vision) accepts images embedded in a conversation and reasons about them in text. Google's Gemini was trained from the ground up to be natively multimodal across text, code, images, and audio. These models can describe images, transcribe speech, read documents, and answer questions that require synthesizing information across modalities. The limitations are real. Multimodal models can hallucinate descriptions of images (confidently describing content that is not present), misread text in images, and fail on spatial reasoning tasks that humans find trivial. Their multimodal capability does not imply general world understanding — it implies learned statistical associations between modalities at very large scale.
Contrastive training aligns representations by pulling together embeddings of matched pairs (an image and its caption) and pushing apart embeddings of unmatched pairs. When done at scale across hundreds of millions of pairs, the resulting embedding space captures semantic similarity across modalities — enabling zero-shot transfer and cross-modal retrieval.
AI agents represent a different kind of research direction. An agent is a system that acts over time to achieve a goal, taking a sequence of actions rather than producing a single output. The term has a precise meaning in reinforcement learning: an agent observes state, selects actions, and receives reward. In the more recent usage, 'AI agent' typically refers to a language model that is given tools — the ability to search the web, run code, read and write files, call APIs — and tasked with completing a multi-step goal by deciding which tools to call and in what sequence. Tool use is the mechanism that enables agents. A language model, on its own, can only output text. Give it access to a Python interpreter and it can write and execute code, see the result, and iterate. Give it access to a web search API and it can look up current information. Give it access to a calendar API and it can schedule meetings. The model decides when to call a tool and with what arguments, receives the output, and continues reasoning. This architecture is sometimes called ReAct (Reasoning and Acting), reflecting that the model interleaves reasoning steps with action steps. What does this enable? An agent tasked with 'summarize the five most recent papers on CRISPR delivery mechanisms' can decompose that into: search for recent papers, retrieve PDFs, extract key findings from each, synthesize across them, and produce a structured summary. A human researcher might take hours; an agent, minutes — assuming the tools work and the model reasons correctly about what to do with them. The failure modes are significant. Current agents hallucinate tool calls (invoking APIs that don't exist or with malformed arguments), lose track of their goal over long sequences of steps, fail on tasks that require precise state tracking, and can take irreversible actions (deleting files, sending emails) based on misunderstandings. Evaluating agent reliability is hard because the space of possible multi-step tasks is enormous and benchmarks are difficult to construct.
A language model that produces wrong text can be corrected before the output is used. An agent that deletes files, sends emails, or purchases items based on a misunderstanding has already altered the world. This irreversibility makes reliability requirements for agents much stricter than for models used only to generate text for human review. Deploying agents in consequential settings without robust human-in-the-loop checks is an active safety concern.
Flashcards — click each card to reveal the answer
Zero-shot image classification using CLIP works by:
Why is evaluating AI agent reliability harder than evaluating a standard classifier?
Design an Agent Task
- Design a multi-step task you would want an AI agent to complete for you — something genuinely useful, not trivial.
- List every tool the agent would need access to (web search, code execution, file reading, email, calendar, etc.).
- Trace the sequence of steps the agent should take to complete the task, including what it should do if a step fails.
- Now identify: which step is most likely to go wrong? What is the worst thing that could happen if the agent makes a mistake at that step?
- Is that mistake reversible? If not, how would you redesign the task or the agent's permissions to make it safer?
- Write your design as a brief specification (goal, tools, steps, failure modes, safeguards) and compare with a classmate.