Cutting Your Token Bill Without Cutting Quality
When an AI feature costs more than it should, the answer is rarely a cheaper model. Here are the techniques we reach for first, in the order we reach for them, and the savings they actually deliver.
When a client asks us to bring down the cost of an AI feature, the request usually comes wrapped in an assumption: that the fix is a cheaper, weaker model, and that quality is the price you pay. It almost never is. Most overspending lives in how the feature uses the model, not in which model it uses. You can take a large bite out of the bill before you touch quality at all.
Here is the order we work through, because the early steps are cheap to do and the savings are real.
You Probably Do Not Need a Cheaper Model Yet
Swapping the model is the most visible lever and usually the wrong one to pull first, because it is the one most likely to degrade the output your users see. Everything below this line saves money without changing what the user gets. We exhaust those first, and frequently never need to downgrade the model at all.
Prompt Caching Is the Closest Thing to Free Money
Most features resend the same large block of context on every call: the system prompt, the instructions, the reference material, the examples. Prompt caching lets the provider store that stable prefix and charge a fraction to reuse it, rather than billing full price to reprocess the same paragraphs thousands of times a day. On features with a heavy fixed preamble, this alone has cut spend by half in our experience. It is the first thing we check, and it is astonishing how often it is simply switched off.
Stop Sending the Whole Context Every Time
The reflex to include everything "so the model has what it needs" is expensive and rarely necessary. Send the relevant chunk, not the whole document. Summarise long conversation history into a short running state instead of resending every message. Trim examples down to the few that actually shape the output. We have halved input cost on features by being disciplined about context with no measurable drop in answer quality, because the model never needed most of what it was being handed.
Batch What Does Not Need to Be Live
A lot of AI work does not need an instant answer. Overnight enrichment, scheduled summaries, classification of a backlog. Where a task can tolerate a delay, batch processing is materially cheaper than firing live requests one at a time, and it smooths your spend instead of spiking it with traffic. The question to ask of every AI call is simple: does a human need this answer in the next second, or just by tomorrow morning.
Use a Small Model for the Small Jobs
Once the structural waste is gone, this is where selective model choice belongs. Not downgrading the whole feature, but handing the genuinely simple steps to a cheap model: the classification, the extraction, the does-this-look-relevant check. The hard step keeps the capable model. You pay the premium only where it changes the result, which is usually a minority of the calls.
Measure Before You Optimise
None of this should be done blind. Before changing anything, we break the bill down by step, by call type, and by user action, so we know where the money actually goes. The expensive part is often not the part anyone suspected. Optimising the wrong step feels productive and saves nothing. A day spent on measurement routinely pays for itself many times over, because it points the effort at the calls that dominate the invoice rather than the ones that merely look costly.
The pattern across all of this is the same. Cost in AI features is mostly an engineering property, not a model property. Treat it as something you design and measure, and you can usually get the bill where it needs to be while the output your users see stays exactly as good as it was.
Looking at an AI bill that feels too high for what the feature does? We can take it apart and tell you where the money is really going.
Related articles
The Real Total Cost of an AI Feature
The token bill is the cost everyone sees and the smallest part of the real number. Here is the rest of the iceberg: evaluation, monitoring, human review, and the quiet maintenance that keeps an AI feature working.
6 min readHow We Pick a Model: Frontier, Mid, or Cheap
The instinct is to reach for the most capable model and stop thinking. That instinct quietly wastes money and adds latency. Here is the decision we actually run for every AI feature we build.
6 min readA Year of Code Agents in Anger: What Actually Stuck
We have used Claude Code, Cursor, Aider, Cline, and most of what is between them on real client work for over twelve months. The tools that survived our rotation are not the ones the launch hype tipped to win.
8 min read