Skip to main content
AI Foundations

⏱ About 20 min20 XP

Evaluating LLM Output

The previous eight lessons have given you a rigorous foundation: you know what language models are, how they are built, what they can genuinely do, and where they reliably fail. This lesson brings all of that together into a practical discipline: how do you evaluate an LLM's output? How do you decide what to trust, what to verify, and when not to use an LLM at all? These are professional skills — the same ones that distinguish a sophisticated AI user from one who is vulnerable to the technology's failure modes.

A Framework for Evaluation

Evaluating LLM output is not about finding errors after the fact — it is a systematic practice that begins before you even write the prompt. The framework has three stages: before, during, and after. Before: Assess the task. For each task you are considering delegating to an LLM, ask: - Is this task well-matched to LLM capabilities? (Text generation, transformation, synthesis, brainstorming — generally yes. Exact arithmetic, real-time data, verified citations — generally no.) - How serious are errors? (Drafting a blog post: errors are easy to catch and fix. Summarizing a medical document for a clinical decision: errors could be dangerous.) - Do I have the background to verify the output? (If you cannot evaluate whether an LLM's legal analysis is correct, you should not rely on it without independent expert review.) During: Monitor as you work. Read each LLM response critically: - Does it actually answer the question you asked, or does it answer an adjacent, easier question? - Are there specific factual claims — numbers, names, dates, citations — that need verification? - Does the reasoning chain hold together, or does a step slip in that does not follow from the previous? - Is the model expressing appropriate uncertainty, or is it stating uncertain things with false confidence? After: Verify before acting. Any LLM output you intend to act on — publish, submit, use to make a decision — should go through explicit verification of its specific claims.

The Capable-but-Unreliable-Collaborator Model

The best mental model for an LLM is a capable but unreliable collaborator: brilliant at drafting and brainstorming, often correct on general knowledge, but prone to confident errors on specifics, terrible at arithmetic without tools, and unable to access information beyond its training. You would not blindly accept everything such a collaborator said — you would treat their work as a strong starting point that requires your critical review.

Specific verification strategies: Fact-checking specific claims: Any specific factual claim — a statistic, a historical date, a scientific finding, a legal rule, an attribution — should be verified against a primary or reliable secondary source. Do not use a different LLM query as verification; that LLM may hallucinate the same false fact. Use authoritative external sources. Code verification: LLM-generated code should be tested, not just read. Code that looks correct may have edge-case bugs, security vulnerabilities, or deprecated API calls. Run it. Test it with inputs it might fail on. Read it line by line for logic errors the tests might miss. Mathematical verification: Do not trust LLM arithmetic. Verify every computation with a calculator or by hand. For more complex mathematical reasoning, check the logic of each step independently. Cross-checking with multiple sources: For general factual claims, verifying against two or three independent authoritative sources provides much stronger confidence than accepting LLM output alone. If multiple sources agree and the LLM's claim matches, the risk of hallucination is lower. If sources conflict or the LLM's claim is not corroborated, treat it as unverified. Knowing your own domain: The most powerful verification tool is your own expertise. An expert reading an LLM's medical explanation can spot implausibilities that a non-expert would miss entirely. This is why domain knowledge remains essential even in an age of powerful LLMs — the LLM's output needs a knowledgeable human in the loop.

When Not to Use an LLM

Not every task is appropriate for LLM assistance. Recognizing when not to use one is as important as knowing how to use one well. High-stakes factual claims without verification infrastructure: If you need accurate, current, specific factual information for a decision where errors have serious consequences — medical, legal, financial — an LLM alone is not an appropriate tool. Use it only if you have the expertise and resources to verify its output rigorously. Private or sensitive information: Many LLM services send your prompts to external servers. Pasting confidential business information, personal medical records, or private communications into a commercial LLM may violate privacy obligations or expose sensitive data. Tasks where you cannot evaluate the output: If you do not have the domain knowledge to assess whether the LLM's output is correct, and the stakes of being wrong are meaningful, do not use the LLM as your primary source. You cannot catch what you cannot recognize. Tasks requiring exact reproducibility: LLM outputs are stochastic — the same prompt may produce different outputs on different runs. For tasks where you need a guaranteed identical output every time (like a legal template that must be verbatim), LLMs are the wrong tool. When attribution is required: LLM output is not citable as a source. If your work requires citing sources — academic papers, journalism, professional reports — the LLM can help you draft, but the underlying claims must be grounded in citable human sources.

You Are the Expert in the Loop

LLM assistance does not transfer responsibility. If you use an LLM to help write a report, draft an email, or analyze a document, you are responsible for the accuracy and appropriateness of the final output. Saying 'the AI wrote it' is not an acceptable defense when the output contains errors. The LLM is a tool; the judgment is yours.

Match each evaluation situation to the correct action.

Terms

LLM provides a specific statistic with a percentage figure
LLM generates Python code for a data processing task
LLM produces a multi-step argument with a logical conclusion
LLM drafts a first paragraph of an essay
LLM is asked for today's stock price

Definitions

Do not use LLM for real-time data; use a live financial data source
Test the code on real inputs, including edge cases, before deploying
Use it as a starting point and revise for accuracy and voice
Check each step of the reasoning independently, not just the conclusion
Verify against an authoritative primary source before using it

Drag terms onto their definitions, or click a term then click a definition to match.

You ask an LLM to summarize a 20-page research paper and it produces a clear, well-organized summary. What is the appropriate next step before using this summary in your own work?

A classmate argues: 'I can tell when the LLM is confident because it uses specific numbers and detailed examples.' Based on this module, why is this reasoning flawed?

Build a Verification Protocol

  1. You are advising a student newspaper that wants to use LLMs to help with research and drafting.
  2. Step 1: Identify four specific tasks the newspaper staff might use an LLM for. Think across the production process: research, writing, editing, headline generation, photo captioning, etc.
  3. Step 2: For each task, assess: (a) how well-matched is it to LLM capabilities? (b) what are the risks if the LLM makes an error? (c) what verification steps are needed before the output is used?
  4. Step 3: Write a one-page 'LLM Use Policy' for the newspaper. It should specify: which tasks LLMs may assist with, what verification is required for each, and what tasks LLMs may not be used for without additional oversight. Write it in plain, actionable language that a high-school journalist could follow.
  5. Step 4: Exchange your policy with a classmate. Identify one gap or ambiguity in each other's policies — a scenario the policy does not clearly handle. Revise accordingly.