Reliability and Hallucination
In 2023, a New York attorney filed a legal brief that cited six cases in support of a motion. The opposing counsel could not locate any of them. The attorney had used ChatGPT to help research the brief; the model had generated plausible-sounding case names, citations, and summaries for cases that did not exist. The attorney was sanctioned. This is hallucination: the generation of confident, fluent, internally consistent information that is factually wrong. Hallucination is not a quirk or an early-stage limitation. It is a structural property of the dominant class of AI language model, arising directly from how these systems are trained and how they generate output. Understanding it precisely — where it comes from, how it is measured, what mitigations exist and what their limits are — is essential knowledge for anyone deploying AI in a context where factual reliability matters.
Why Hallucination Is Structural
A large language model is trained to predict the next token in a sequence given all preceding tokens. At inference time, it extends a prompt token by token, each time sampling from a probability distribution over the vocabulary. The model is not retrieving information from a verified database; it is completing a pattern. This has a critical consequence: the model's output is shaped by what patterns were common in training data, not by whether those patterns correspond to true propositions in the world. A plausible-sounding legal citation is a pattern the model has seen many times in the form [Author v. Author, volume Reporter page (year)]. It can generate that pattern fluently. It cannot check whether the generated instance of that pattern corresponds to a real case. This is why hallucination is called structural. It is not caused by a bug in the training pipeline or by an insufficient amount of data. It follows directly from the objective: maximize the probability of likely next tokens. A fluent, confident false statement is just as likely under that objective as a fluent, confident true one — unless the training signal explicitly penalizes falsehood, which is technically very difficult to scale. Instruction-tuning and reinforcement learning from human feedback (RLHF) improve factual accuracy somewhat — human raters prefer accurate outputs, so the model learns a preference for accuracy. But the underlying generation mechanism remains probabilistic token prediction. RLHF reduces hallucination frequency; it does not eliminate the structural source of hallucination.
Language models predict what text is likely to come next. Likelihood is learned from statistical patterns in training data, not from a correspondence to ground truth. A well-formed false sentence can be just as statistically likely as a well-formed true one.
Taxonomy of Hallucination Types
Researchers distinguish several varieties of hallucination, each with different causes and mitigations. Intrinsic hallucination occurs when the model generates output that contradicts its own context. A model given a document that says 'The company was founded in 1987' and then asked 'When was the company founded?' might answer '1992.' The error is entirely internal — the correct answer was supplied in the context. Extrinsic hallucination occurs when the model generates information that cannot be verified against or contradicted by the provided context — it simply invents. Fabricated citations are the canonical example: no source was provided to contradict, so the model invents freely. Factual hallucination involves claims about the world that are false regardless of context. The model might state that Marie Curie won the Nobel Prize in Chemistry (she won in Chemistry once, but also in Physics — confusing the two counts as factual error). Faithfulness hallucination occurs in tasks like summarization: the summary introduces claims not present in the source document. Studies of model-generated news summaries found faithfulness errors in 5-30% of outputs depending on the model and the source domain. The rate and type of hallucination vary by domain. Models hallucinate more frequently on: obscure factual claims with few training examples, queries requiring precise numerical information, recent events after the training cutoff, and highly specialized professional knowledge (legal citations, medical dosages, mathematical proofs).
Flashcards — click each card to reveal the answer
Measuring and Mitigating Hallucination
Measuring hallucination is itself hard. The most rigorous method — human factual verification of every output claim — does not scale. Automated metrics like BERTScore compare outputs to reference texts semantically, but a model can hallucinate while staying semantically close to a reference. Newer approaches use one model to evaluate another's factual accuracy (LLM-as-judge), which introduces the risk that the evaluating model itself hallucinates. The most widely deployed mitigation is Retrieval-Augmented Generation (RAG): at inference time, the system retrieves relevant documents from a verified corpus and includes them in the prompt. The model is then instructed to ground its answer in those documents. RAG substantially reduces hallucination on knowledge-intensive tasks because the answer is partially pre-specified — the model needs to extract and synthesize, not invent. However, RAG introduces its own failure modes: retrieval errors (the wrong document is fetched), faithfulness failures (the model ignores the retrieved document), and the fundamental gap between what is retrievable and what is true. Another approach is uncertainty quantification: asking models to express confidence levels alongside their answers, or using ensemble methods to detect when models disagree. When a model says 'I am not certain, but...' and backs it up with genuine calibration, users can modulate their trust appropriately. Current models are poorly calibrated — their stated confidence often does not track their actual accuracy. Chain-of-thought prompting, which asks the model to show its reasoning step by step, reduces certain categories of hallucination by forcing the model to commit to intermediate claims that can be checked. But models can also hallucinate coherent-looking reasoning chains that lead to wrong conclusions.
A model is given a document stating 'Project Alpha launched in Q3 2022' and then asked 'When did Project Alpha launch?' The model answers 'Q1 2023.' This is best classified as:
Retrieval-Augmented Generation (RAG) reduces hallucination. Which of the following is a limitation RAG does NOT solve?
Design a Hallucination Detection Protocol
- You are deploying an AI assistant to help paralegals research case law. Hallucination of legal citations could result in sanctions, lost cases, and harm to clients.
- Step 1: Identify three types of hallucination (from the taxonomy in this lesson) that are most likely to occur in this legal research context. For each, give a concrete example of what a hallucinated output would look like.
- Step 2: Design a detection workflow. For each hallucination type you identified, specify: (a) the detection method you would use, (b) who or what would perform the check, and (c) at what point in the workflow the check occurs.
- Step 3: Identify one hallucination type that your detection protocol cannot reliably catch. Explain why.
- Step 4: Write a one-paragraph policy for the law firm describing what the AI assistant is and is not permitted to do without mandatory human verification.
- Discuss: Is there any use of AI in legal research that could be permitted without case-by-case hallucination checking? What would that use look like?