Managing Hallucinations

A language model is a plausibility engine. It predicts the next word, optimizing for what sounds right rather than what’s true — the hard landing of the first pillar. Most of the time those two overlap and you never notice. This pillar is about the times they come apart.

When a model states something false in the same fluent, even voice it uses for everything else, we call it a hallucination. The word makes it sound like a glitch. It is closer to a side effect: the model is doing exactly what it was built to do, producing the most probable-looking continuation of your text. When the most probable-looking words happen to be false, you get a hallucination — not because anything broke, but because probable and true were never the same target.

That reframing changes the job in front of you. There is no line of code that holds the lie, no patch that removes it, because the thing producing it is the core mechanism, not a defect bolted onto the side. You do not fix a hallucination. You manage it. The rest of this pillar is about how — and about why a one-percent error rate can be perfectly fine in one place and a lawsuit in another.

Why a confident machine invents things

The recovered framing for this pillar named two forces, and they still hold: hallucinations come from prediction weakness and context collapse. Take them in turn.

Prediction weakness. On a topic with deep, consistent coverage in the training data, the next-word prediction is strong and the answer is usually right. Push toward the edges — a niche field, a person with a thin paper trail, a fact more recent than the training cut-off — and the patterns thin out. The model still has to emit a token. With nothing solid to lean on, it produces whatever best fits the shape of an answer: a citation in the exact format of a real one, a number with a plausible count of digits. (Pillar 1’s lawyers learned this the hard way, with a federal brief full of cases ChatGPT had invented.) The quality of the answer rises and falls with the strength of the patterns underneath it, and the interface gives you no way to see which regime you’re in.

No internal truth signal. This is what makes hallucination dangerous rather than merely annoying. The model’s tone does not change when it crosses from fact into fiction. There is no flicker of doubt, no hedge, no internal meter reading low confidence — verify this. It states an invented policy with the same calm certainty it uses to quote a real one. Confidence, in these systems, is a property of the writing, not a measure of the facts.

When the context collapses

The second force shows up in long sessions. A model works inside a finite context window and a finite ability to keep the thread, and as the input grows — a sprawling conversation, a stack of retrieved passages — earlier details get crowded out and blurred. Researchers have a name for one version of this: models get lost in the middle, reading the start and end of a long context far more reliably than everything buried in between. (Liu et al., 2023) The model starts answering from a smeared version of what you actually gave it — conflating two sources, or dropping the constraint you set fifteen messages ago.

The longer the session runs, the more the ground shifts. A model that was accurate in the first exchange can drift into confident nonsense by the twentieth — not because anything broke, but because it is reconstructing your context from a fading impression instead of reading it fresh each time.

And none of this hits every model equally. Bigger, newer models hallucinate noticeably less on well-covered facts than smaller, older ones — the patterns they learned are denser and stronger. But less is not none, and context collapse barely improves with scale. If anything it gets more tempting: a larger context window invites you to stuff more in, and the model still rots over a long, crowded context no matter how good it is. Two practical consequences. Don’t assume a hallucination rate you measured on one model carries over to another. And don’t design around next year’s model quietly retiring the problem — it won’t.

The four shapes a hallucination takes

It is not one failure but several. These are the ones worth recognizing on sight.

Fabricated facts & citations

Invents a source, a quote, a statistic, or a legal case with the exact shape of a real one. The format is perfect; the thing it points to doesn't exist.

Faulty reasoning chains

Lays out steps that sound logical and land on a wrong answer. The conclusion inherits the confidence of the steps, so the error hides in plain sight.

Confident incorrectness

States the wrong thing in exactly the voice it uses for the right thing. No hedge, no flag — which is what makes it so easy to accept.

Fabricated lists

Produces an authoritative-looking list and pads it with items that don't belong, because a fuller list is the more plausible-looking shape.

It stops being abstract the moment one of these costs someone money.

In late 2022, a man arranging travel for his grandmother’s funeral asked Air Canada’s website chatbot about bereavement fares. The bot told him he could claim the discount retroactively, after the flight — and linked, in the same breath, to a policy page that said the opposite. He booked on the bot’s advice, the airline refused the refund, and a tribunal sided with the customer, holding Air Canada liable for what its chatbot had said.

That case — Moffatt v. Air Canada, 2024 BCCRT 149 — is the whole pillar in one receipt. The tribunal rejected Air Canada’s argument that the chatbot was a separate entity answerable for its own words, which is the line every leader should underline: you are liable for what your model says. The bot wasn’t hacked and nothing malfunctioned. It produced a plausible-sounding policy because a plausible-sounding policy was the most probable thing to say, and the company owned the gap between plausible and true.

You contain it, you don’t cure it

If you can’t patch out the mechanism, you build around it. Four layers do most of the work, and the more of them you stack, the lower the risk drops — though never to zero.

Stack the layers — each one lowers the rate, none removes it

Layer 1

Ground it

Give the model verified text to predict over — retrieval, a knowledge base, an ontology. It shrinks the gap where invention happens.

Layer 2

Constrain it

Limit what it can even say — structured outputs, schemas, allowed-value lists. A model that can only return a real SKU can't invent a fake one.

Layer 3

Check it

Put a verifier at the handoff — a rule, a second model, or a human where the stakes are high. This is the Action step of the CIDA lens.

Layer 4

Measure it

You can't manage what you don't measure. Track the rate against a fixed test set and treat a regression like any other broken test.

None of these are prompt tricks. They are engineering — the Decision-procedure step from the first pillar, the part you actually build. Containing hallucination well is one of the highest-leverage things a data team can do with these tools, and it is most of what separates a demo that impresses from a system you can put in front of a customer.

This is the work we do. The containment layer — context retrieval that grounds a model in your own verified data, the prompt and guardrail engineering that keeps it on task, the evaluation that proves the rate is actually falling — is the core of what TwiceData builds for clients. We draw on both the proven industry standards and the techniques still coming off the bleeding edge, and tune the mix to what your stakes genuinely require. A model you hope is right and a model you can defend are separated almost entirely by this layer.

Step back to the lens. All of that containment converges on one bucket — the Action step, the handoff where the output meets the world and something has to check it before it acts. That hand-off is where this whole pillar lives.

Context

Where it's used — which sets how much a wrong answer costs.

Inputs

What data grounds the model.

Decision procedure

The model and the guardrails around it.

Action

What happens to the output. The handoff is where a hallucination is caught — or isn't.

This pillar lives here

Code is a special case

Hallucination in code behaves differently from hallucination in prose, and it’s worth understanding why — because more and more of the generative AI a business actually ships is code.

The good news first. In code, a lot of hallucinations announce themselves. A made-up function, a wrong API signature, a field that violates a schema, a stray bit of invalid syntax — these don’t slip out silently the way an invented refund policy does. They throw an error. A strictly-typed language or a schema is, in effect, a free verifier: the compiler catches what the model invented. It is also what makes the agentic loop work — the model writes code, runs it, reads the error, and tries to patch its own mistake without a human in the seat.

The catch lives in that loop. When a model hallucinates, fails, re-hallucinates, and tries again, it can spin, burning tokens and compute chasing its own tail. A hallucination in a chat costs you a wrong sentence. A hallucination inside an autonomous coding agent can cost you a whole run: minutes or hours of compute, a stack of API calls, and a fix that may still be subtly wrong. Writing code with generative AI feels easier than ever on the surface; the cost just moved — out of the typing and into the debugging loop, where a confident wrong turn quietly runs up the bill. That meter is Pillar 5’s subject, and we’ll get to it.

When the answer is a number

Here is a danger that hides in plain sight, because numbers look like facts. A model predicts the next token even when that token is a digit. Ask it for 2 plus 2 and it will tell you 4 — not because it computed anything, but because 4 is overwhelmingly the most probable token to follow 2 + 2 = in everything it ever read. The arithmetic was never run. The answer was predicted.

That holds until it doesn’t. Multiply two large numbers, total a column of figures, or work out a statistic, and the model is still predicting a plausible-looking number rather than calculating the right one — and it will hand you a confident, wrong figure in the same even voice as everything else. The risk is sharpened by how numbers read: a figure in a report looks computed, authoritative, settled. Nobody re-derives it.

There is a second, quieter failure that even a correct number doesn’t escape. Once a figure is in the conversation, the model doesn’t store it. Every time it restates that number — in a summary, a follow-up, a paragraph further down — it is predicting the digits again, not copying them. So a value that was right the first time can drift the second: $1.24M in the input quietly becomes $1.42M in the recap, with no error and nothing to mark that it moved. The number was never held. It was re-guessed. This is context collapse wearing its most expensive disguise, because the output still looks precise. Formatting is the same trap by another door: ask the model to reformat a date, round a figure to two decimals, or switch a number from European to American notation, and each of those is one more prediction that can land wrong while looking perfectly tidy.

The fix is to take the math away from the model — and the memory of the number with it. Tool-calling and code-execution models are far better here precisely because they don’t try: they hand the calculation to actual code (a Python interpreter, a SQL query, a calculator) and predict only the prose around the result. For any number that has to be right — a financial figure, an average, a statistic a decision rides on — generate it with real code, keep the canonical value in that code or your database, and drop it into the output verbatim. Let the model explain the number. Never let it compute the number, reformat it, or be the thing that remembers it.

In practice that means a single, validated source of truth for your figures — a table, an API, a metrics layer — that the system looks up every time it needs to cite one, rather than letting the model recall it from the conversation. Where you can’t enforce that yet, the floor is awareness: everyone touching the output should know that any number the model restates was re-predicted, not retrieved, and treat it that way.

It is one of the most under-audited risks in a live generative AI system: teams scrutinize the wording and trust the numbers, when it should be the other way around.

Questions a skeptic asks

The honest pushback, answered straight.

Won't the next model just stop hallucinating?

Each generation hallucinates less on common ground and still fails at the edges, because the mechanism is prediction itself, not a bug a new version patches out. Build for it now; don't design around a model that makes the problem disappear.

Can't I just tell it not to make things up?

Those ultimatums — "do not make a mistake," "a life depends on this," even bribes and threats — are prompt engineering at its most naive: an emotional plea trying to motivate the model into being right. Real prompt engineering is a craft, not a plea — a deliberate system prompt that fixes the model's role and the shape of what it returns. It is genuinely useful, and it still doesn't install a sense of truth. The best-built prompt shapes how the model answers, never whether the answer is true — so it trims sloppiness but can't replace verification.

If it cites a source, isn't the answer safe?

Models invent citations too, and even a real, retrieved source gets misquoted. A citation is a place to check, not proof that the check passed.

Isn't a low hallucination rate good enough?

It depends entirely on the cost of a single one. One percent is fine for first-draft copy and unacceptable for a dosage, a contract clause, or a refund policy. The acceptable rate is set by the Context, not by the model.

We're using it internally, not customer-facing — does it matter?

Internal hallucinations don't make the news, but they still poison decisions: a wrong number in a board deck, an invented clause in a policy memo, a fabricated stat that quietly becomes someone's target. The blast radius is just harder to see.

Aren't the reasoning models more trustworthy because they show their work?

The visible steps help on hard problems, and they give you something to audit, which is genuinely useful. But the steps are still predicted, and a confident, well-formatted chain of reasoning can walk you politely to a wrong answer. Read the steps; don't just trust that they exist.

What we will not claim (anti-fabrication)

Hallucination is basically a solved problem now.

It is measurably better than it was, and still fundamental to how these models work. Every serious deployment budgets for it. Treating it as solved is how the expensive surprises happen.

A confident answer is a reliable answer.

Confidence is a feature of the writing, not a signal of truth — the tone is identical whether the model is right or inventing. The fluent, certain answers deserve more scrutiny, not less.

Where this goes next

The most reliable way to hold hallucination down is to ground the model in your own verified data — which raises the obvious next question. Before you put that data in front of a generative AI, is it safe to? Part 3, Is my data safe?, is about exactly that: privacy, security, and where your data is allowed to live and be processed in the first place.

Whether you already have generative AI in production and want to know how much it’s actually inventing, or you’re scoping a first build and want to design the containment in from the start, measuring and lowering the hallucination rate is concrete, engineerable work — and a good place for a free 60-minute call to start. For how this fits the data and AI work we deliver, see our approach to AI engagements.

––