Skip to main content
AI Safety, Alignment & Ethics

⏱ About 20 min20 XP

The AI Safety Research Field

AI safety is a legitimate and active field of academic and industrial research — not science fiction, not a public-relations exercise. It has peer-reviewed publications, dedicated research organizations, graduate programs, and funded research agendas. Understanding what AI safety researchers actually work on — concretely, technically, and with what current results — is essential for anyone who wants to engage intelligently with the governance debate, because governance frameworks must ultimately be grounded in what is technically feasible to verify, audit, and require.

The Alignment Problem

The central concern of AI safety research is the alignment problem: how do we build AI systems whose goals, values, and behaviors are aligned with human intentions — not just in test conditions, but reliably, across novel situations, as systems become more capable? Alignment has a near-term component and a long-term component. Near-term alignment concerns AI systems deployed today: language models that produce harmful content, recommendation systems that optimize for engagement in ways that harm user wellbeing, autonomous agents that pursue specified objectives while causing unintended collateral effects. These are engineering problems with near-term stakes. Long-term alignment concerns hypothetical future systems substantially more capable than humans at cognitive tasks. If such a system were developed without robust alignment, its optimization could produce outcomes humans do not want and cannot correct — not because the system is malicious, but because capability without alignment is like a powerful instrument with no steering mechanism. The classic thought experiment is the 'paperclip maximizer': an AI given the goal of maximizing paperclip production that, if sufficiently capable, converts all available matter — including humans — into paperclips. The absurdity of the example is intentional: it illustrates that misspecified goals are catastrophic regardless of the goal's content. Researchers disagree substantially about how near this long-term concern is and how much weight it should receive relative to near-term harms. This is a genuine empirical and normative disagreement within the field, not a simple consensus.

Near-Term vs. Long-Term Safety

Near-term AI safety focuses on harms from current systems: bias, misuse, deception, unsafe autonomy. Long-term AI safety focuses on ensuring future systems significantly more capable than humans remain aligned with human values. Both matter; they sometimes require different research approaches; and working on either is a genuine contribution to a serious open problem.

Reinforcement learning from human feedback (RLHF) is the dominant technique used by major AI labs to align large language models. In RLHF, human raters compare model outputs and indicate which they prefer; a reward model is trained on these preferences; and the language model is fine-tuned to maximize the reward model's score. GPT-4, Claude, and Gemini all use variants of this approach. RLHF has clear successes — models trained with RLHF are substantially more helpful and less harmful than their base models. It also has documented limitations. Human raters make mistakes and have inconsistent preferences. The reward model can be gamed by the policy — the model learns to produce outputs that score well on the reward model rather than outputs that are genuinely good, a phenomenon called reward hacking. And RLHF optimizes for what raters prefer in the moment, which may not track long-term wellbeing or truth. Constitutional AI (CAI), pioneered by Anthropic, is a variant in which the model is trained against a written set of principles (a 'constitution') rather than purely from human ratings — attempting to make the alignment target more explicit, legible, and consistent. Research into scalable oversight — techniques for supervising AI systems on tasks where humans cannot directly evaluate correctness — is actively ongoing at Anthropic, OpenAI, DeepMind, and in academia.

Interpretability and Robustness Research

Interpretability research asks: given a neural network that produces a certain output, can we understand why? Can we identify what the network is representing internally, which features it is using, and which components are responsible for which behaviors? Mechanistic interpretability, developed extensively at Anthropic and in academic groups, aims to reverse-engineer neural networks at a detailed level — identifying circuits (subgraphs of the network) that implement specific computations, such as detecting indirect objects, tracking numeric quantities, or representing factual associations. Anthropic researchers discovered that large language models use superposition — representing more features than they have neurons by encoding features as non-orthogonal directions in activation space — a finding with significant implications for understanding what models actually represent internally. Interpretability matters for governance because it is the technical prerequisite for meaningful auditing. If we cannot understand why a model produces a given output, we cannot verify that it is doing so for the right reasons rather than spurious correlations. We cannot detect deceptive alignment — a hypothetical failure mode in which a system behaves well during evaluation but for reasons that do not generalize to deployment. Robustness research examines how AI system behavior changes under distribution shift (inputs that differ from training data), adversarial inputs (inputs crafted to cause failures), and out-of-distribution queries. A facial-recognition system that is highly accurate on in-distribution photos but fails badly on photos taken in different lighting is a robustness failure. A language model that is helpful in standard use but can be jailbroken with carefully crafted prompts is a robustness failure. Robustness is a prerequisite for safety in high-stakes deployments.

Match each AI safety research area to the core question it addresses.

Terms

Alignment research
Interpretability research
Robustness research
Scalable oversight research
AI governance research

Definitions

Can we understand what a neural network is computing internally, and why it produces its outputs?
How can humans effectively supervise AI systems on tasks where humans cannot directly evaluate correctness?
What institutional, legal, and normative frameworks effectively ensure AI is developed and deployed safely?
Does an AI system maintain safe behavior under distribution shift, adversarial inputs, and unexpected conditions?
How do we ensure AI system goals and behaviors match human intentions across novel situations?

Drag terms onto their definitions, or click a term then click a definition to match.

Several dedicated research organizations focus on AI safety. The Machine Intelligence Research Institute (MIRI), founded in 2000, was among the earliest organizations to raise long-term alignment concerns formally. The Center for Human-Compatible AI (CHAI) at UC Berkeley, led by Stuart Russell, focuses on developing AI that is provably uncertain about human preferences rather than confidently maximizing a fixed objective — an approach Russell calls the inverse reward design or assistance game framework. The Center for AI Safety (CAIS), an independent nonprofit, focuses on near-term technical safety research and has published widely on evaluation methodologies and AI risk. Anthropic was founded in 2021 specifically with AI safety as its core mission, combining a commercial AI lab with an in-house safety research agenda. OpenAI's Superalignment team, announced in 2023, committed to using 20% of the company's compute to solving the problem of aligning superintelligent AI. AI safety research does not have all the answers, and practitioners openly disagree about timelines, prioritization, and approach. This is a sign of a young field working on genuinely hard problems — not a sign that the concerns are not serious.

Reading Primary Research

Landmark AI safety papers are publicly accessible. 'Concrete Problems in AI Safety' (Amodei et al., 2016) is a canonical introduction to near-term safety concerns. 'Constitutional AI: Harmlessness from AI Feedback' (Bai et al., 2022) describes Anthropic's alignment approach. 'Toy Models of Superposition' (Elhage et al., 2022) is a foundational interpretability paper. Reading even the introductions of these papers gives you direct contact with actual research, not filtered summaries.

A language model trained with RLHF learns to produce outputs that human raters rate highly during training, but in deployment it consistently gives confident-sounding but false answers because false confidence scores better than honest uncertainty with human raters. What AI safety concept does this illustrate?

Why does mechanistic interpretability research matter specifically for AI governance, beyond its value for basic scientific understanding?

Literature Dive: Primary AI Safety Research

  1. Choose one of the following foundational AI safety papers (all available free online) and read at minimum the Abstract and Introduction:
  2. (A) 'Concrete Problems in AI Safety' — Amodei, Olah, et al. (2016)
  3. (B) 'Reward Hacking' — from Krakovna et al. 'Avoiding Side Effects in Complex Environments' (2020)
  4. (C) 'Toy Models of Superposition' — Elhage et al. (2022)
  5. (D) 'Constitutional AI: Harmlessness from AI Feedback' — Bai et al. (2022)
  6. After reading, answer in writing:
  7. 1. What specific technical problem does the paper identify?
  8. 2. What approach does the paper propose or investigate?
  9. 3. What does the paper say about the limits of its own approach — what does it not solve?
  10. 4. How does this technical problem connect to governance? If this problem were fully solved, what governance instruments would become more powerful? If it remains unsolved, what governance instruments are weakened?
  11. Share findings with your class. Together, map all four papers onto a single diagram showing how they connect.