Two pillars ago, the cost of a hallucinating coding agent quietly running up a compute bill got a name: this pillar. A pillar later, the price of putting humans at the right handoffs got the same deferral. Here is where the bill comes due.
When a team budgets for generative AI, they budget for the thing with a price tag: the per-token API fee. It is the number on the pricing page — the easiest cost to see, and the easiest to forecast. It is also, on almost every real system, the most visible cost rather than the largest one. The expensive parts don’t come off a pricing page. They show up as the work you do around the model, and they are the reason a project that looked cheap in the demo turns out to cost several times the quote in production.
This pillar opens the whole bill. We start with the token fee itself — because even the visible number is bigger than the pricing page admits — then widen out to everything stacked behind it. The last word goes to the one decision, rent or buy, that moves the total more than anything else.
Pillar 1’s lens has a property nobody wrote into the acronym: every bucket costs money. Unlike the other pillars, this one doesn’t live in a single square — it taxes all four.
The work of deciding whether the use case is even worth the spend.
Data prep, the retrieval pipeline, the knowledge base.
The model, the prompt and guardrail engineering, evaluation, and rework.
Human oversight at every consequential handoff.
The token bill, line by line
Start with the visible number, because even it has a floor most people miss. A token is roughly three-quarters of a word. You pay for two streams at two different rates: the tokens you send in — your prompt, plus everything you load into its context — and the tokens the model sends back. Output is the expensive direction, usually four to six times the input rate, so a one-line question that pulls back a long answer costs far more than its length suggests.
Here is the spread you are actually choosing between.
| Model class | Input · $/M tokens | Output · $/M tokens | What you reach for it for |
|---|---|---|---|
| Small / fast | $0.10 – $1 | $0.40 – $5 | High-volume, simple calls — classification, extraction, routing |
| Mid / workhorse | ~$3 | ~$15 | Most production work |
| Frontier reasoning | $5 – $30 | $25 – $180 | The genuinely hard problems |
Market rates, June 2026. "$/M" is US dollars per million tokens; a million tokens is roughly 750,000 words. The spread is the headline: the model you reach for on the hard problems can cost more than two orders of magnitude beyond the one you reach for on the easy ones. Write the date down — by the time you read this, every number above will have moved.
The tokens you never see
A reasoning model "thinks" before it answers, and that thinking is made of tokens — billed at the output rate, and never shown to you. A question that returns 500 visible tokens can quietly burn several thousand behind the scenes. In practice the all-in cost of a reasoning model often lands three to nine times its headline output price (mid-2026). The meter runs on words you will never read.
And there is a comforting story worth puncturing: that tokens only ever get cheaper. It is half true. Last year’s frontier model does get cheaper as it ages into the mid-tier — sometimes dramatically. But each new flagship lands at the top of the price range, not the bottom, because the capability you actually want is always the newest, and the newest is always the most expensive to serve. Two things happen at once: the floor keeps dropping while the ceiling you keep reaching for keeps rising. If your use case needs the best model on the day, your unit cost trends up with each release — not down.
What’s actually on the bill
Now zoom out. Strip the token fee away entirely and here is what remains — usually the bigger half of the budget.
Cleaning, structuring, and labeling the data the model needs. Routinely the most labor-intensive line, and the easiest to underestimate.
The test sets and harnesses that prove the system is right and stays right. You can't manage a quality or hallucination rate you don't measure.
The checks at every consequential handoff from Pillar 4. People are the most reliable verifier — and the most expensive one.
The agentic loop from Pillar 2 that burns compute chasing its own tail, plus the human time to fix what it got subtly wrong.
The retrieval, knowledge base, and pipelines that keep the model honest — built and maintained, not bought once.
The premium salaries to build it, and the re-testing every time a model updates underneath you. The bill recurs.
Rent vs buy: the GPU math
The single biggest lever on the total is a decision you met in Pillar 3 wearing a different hat — cloud or local. There it was about where your data is allowed to live. Here it is about money, and the shape is rent-versus-buy.
Pay per token. Near-zero to start, then it scales linearly with use — generous at low volume, punishing at high, where heavy production traffic can run into six or seven figures a year. Pure operating expense. You own nothing, and the price is set by someone else's roadmap.
Stand up your own GPUs and the marginal cost per call falls toward the electricity bill. At sustained volume the math can break even in months. You own the hardware and the data path. The catch is everything it takes to actually run it — which is where the sticker price stops telling the truth.
What a GPU in AWS actually costs
“Buy” rarely means buying. For most teams it means renting GPUs by the hour from a cloud, and AWS is the default. The hourly rate is easy to look up. It is also the smallest part of what you’ll pay.
| Instance | GPUs | On-demand | Running nonstop |
|---|---|---|---|
g6.xlarge | 1× L4 | ~$0.80 / hr | ~$580 / mo |
p4d.24xlarge | 8× A100 | ~$32.77 / hr | ~$23,600 / mo |
p5.48xlarge | 8× H100 | ~$55 / hr | ~$39,600 / mo |
On-demand list prices, AWS us-east-1, June 2026. AWS cut P5 rates roughly 44% in mid-2025, and has since been raising the price of reserved GPU capacity as demand climbs — so these will move again, which is exactly why the date is on them. The monthly figures assume the box runs 24/7. That assumption is the trap.
The hourly number is the part everyone quotes. Here is the part nobody does.
A GPU bills the full hourly rate at zero percent utilization, and the meter only stops when you stop it. A box kept warm for low latency — or left running over a forgotten weekend — bills the same as one under full load. Forget a modest GPU for a month and the "cheap" option is a four-figure surprise.
Stop the instance to dodge the hourly rate and the storage keeps charging. The root volume holding the OS image, the CUDA stack, the model weights is EBS you rent by the month whether the box runs or not — tens of gigabytes of it, billing in the background.
Nobody wants to rebuild a GPU image from scratch, so you snapshot it. That saved machine image is more storage, billed quietly every month, indefinitely — a standing charge just to keep your setup on the shelf.
A reserved static IP address is free while it's attached to a running box and billed the moment it isn't. A few dollars a month of pure waste, and exactly the kind of line nobody remembers to clean up.
Top-end GPUs are scarce. AWS rations them through Capacity Blocks — reserved GPU time you book in advance — and has been raising those prices as demand climbs (2026). "Buy" can quietly mean "wait," or "commit to a year to jump the queue."
Someone keeps the drivers, the CUDA versions, the orchestration, the monitoring, and the failover alive. That is a specialist salary, recurring — and it never shows up on the EC2 invoice at all.
The trap: even if you do everything right and stop the box after every run, the disk and the saved image keep billing — a few dollars a month, every month, whether you generate anything or not. And one instance somebody forgot to stop dwarfs months of careful usage. Model your own workload before deciding; the only real mistake is never running the math.
And these examples under-tell it
The instances above can understate self-hosting, not overstate it. A small model runs happily on a single cheap card. But the open models teams actually want to bring in-house are enormous — a frontier open model like DeepSeek runs to hundreds of gigabytes of weights, enough to need a whole cluster of top-end GPUs just to load it, before it answers a single request. And model weights are still growing. If your plan is to self-host a big open model, the real number is a multiple of the single-instance figures here.
When cheap turns expensive
Here is the trap that catches good engineers. As a use case succeeds, per-call cost and total cost move in opposite directions.
Generative AI for your own code is about the cheapest serious use there is. A developer leaning on a coding model all day might run a few dollars to a few tens of dollars in tokens — bounded by the size of your team, paid back many times over in the hours it saves, and entirely internal. Easy to approve. Easy to forecast.
Now take that same capability and publish it — wrap it in a product and put it in front of users. The economics invert. Usage is no longer bounded by your team; it’s bounded by your user base, which you are spending money to grow. Every free trial is your token bill. A feature that goes viral is an invoice, not only a win. The cheaper the per-call looked, the more tempting it was to expose generously — and generous exposure is exactly where a tiny unit cost detonates into a real one.
One "summarize this" button, priced out.
Per call: ~4k input + ~1k output tokens
Cost / call: ~$0.03 (mid-tier model, June 2026)
10 users × 5 calls/day -> ~$45 / month
10,000 users × 5 calls/day -> ~$45,000 / month
one viral week (10× traffic) -> ~$105,000 — 2× the monthly budget, in 7 days Nothing on this card is expensive. Volume is. A price low enough to ignore per call is a price high enough to bankrupt at scale — which is why a published feature needs a budget designed in, not bolted on after the first big week.
This is the moment cost stops being a line item and becomes a design constraint. Rate limits. Usage caps and paid tiers. Caching the answers you’ve already paid for once. Routing the easy calls to a cheap model and saving the expensive one for when it earns its keep. Passing real cost through to the user instead of silently eating it. None of that is premature optimization on a published AI product — it’s the line between a feature and a liability.
Budget for the whole system
The teams that get burned are the ones who budgeted for the API and discovered the system. The teams that don’t are the ones who priced the whole thing up front — the visible token line, the reasoning tokens hiding behind it, the hidden line items from data prep through oversight, and the rent-versus-buy math for their real volume — and then decided, eyes open, whether the use case was worth it.
That last part is the Context bucket doing its job. Not every use case earns its total cost, and finding that out before you build is the cheapest money you will ever spend.
This is the accounting we do. We model the true total cost of a generative AI system before it's built — the hidden line items, the rent-versus-buy math for your real volume, the right model at each stage — so the number you commit to is the real one, not the sticker. What you get is a build plan, not a verdict: what's buildable and how we'd build it, with the expensive compute placed where it earns its keep and a cheaper model carrying the rest.
Questions a skeptic asks
The honest pushback, answered straight.
The API is so cheap now — isn't generative AI basically free?
The call is cheap; the system isn't. Tokens are often the smallest line on a real generative AI budget, dwarfed by data prep, evaluation, oversight, and rework. "The model is cheap" and "the project is cheap" are different sentences.
Reasoning models are smarter — shouldn't we just always use the best one?
Only where the problem earns it. The strongest reasoning models — the kind powerful enough to be turned loose hunting security holes in critical systems — bill their hidden thinking at the output rate, so every call pays for deliberation whether it needed it or not. Pointing one of those at "what's a good tiramisu recipe?" is like calling in the bomb squad to open a jar of pickles: you'll get an answer, with a bill and a blast radius to match. The skill is routing — the heavyweight on the genuinely hard tenth of your traffic, a cheap model on the other ninety percent.
Should the model you build with be the model you ship with?
Not usually — and conflating them is how budgets blow up. The heavy, expensive compute often belongs in the build: spend a frontier model on the research and the evaluation work before anything reaches a stakeholder. That front-loaded spend is what lets the released tool run on a cheap model day to day. The expensive model earns its keep building the thing; the cheap one serves it.
We'll just self-host to escape the per-token meter.
Sometimes right, often a trap. The hourly GPU rate is the smallest line — the idle time, the scarce capacity, the storage and egress, and the people to run it are the rest. Self-hosting swaps a metered bill you understand for a fixed bill you have to operate. Worth it at real, sustained volume; expensive theater below it.
Can't we just use a smaller, cheaper model?
Often, and you should, where it holds quality. But a cheaper model that needs more retries, more oversight, or more grounding can cost more all-in than the pricier one that gets it right. Optimize the total, not the per-token line.
Generative AI keeps getting cheaper — won't this sort itself out?
Old models get cheaper, and that's real. But the frontier model you actually reach for lands at the top of the price range every release, and the costs that dominate the bill — your data, your evaluation, your people — don't fall on the token curve at all. Cheaper tokens just make the smallest line smaller.
What we will not claim (anti-fabrication)
The API price is the cost.
It's the visible cost, and usually the smallest one. Data prep, evaluation, oversight, and rework routinely outweigh it — and the reasoning tokens you never see inflate even the visible line. The real bill is the system, not the call.
Self-hosting is always cheaper than the API.
Only past a volume threshold most teams never reach. Below it, an idle GPU you rent by the hour and pay a specialist to babysit costs far more than a metered call you only pay for when you make it.
Generative AI keeps getting cheaper, so cost isn't a concern.
Tokens get cheaper; the work around them doesn't, on the same curve. Treating a falling sticker price as a falling total is how budgets blow up.
Where this goes next
Knowing what a system costs is one kind of accountability. The last pillar is about the other kind — being able to show, after the fact, why the generative AI did what it did, and prove it to an auditor, a regulator, or a customer. Part 6, the final one, is about reproducibility, explainability, and traceability: making generative AI you can account for.
Whether you’ve already deployed generative AI and the bill is bigger than the quote, or you’re scoping a build and want the true number before you commit, modeling the full cost — and the rent-versus-buy decision behind it — is concrete, do-able work, and a good place for a free 60-minute call to start. For how this fits the data and AI work we deliver, see our approach to AI engagements.
––