Small Models, Big Savings: When You Don't Need the Frontier
Small and open models have quietly become good enough for a large share of real work. Here is where we reach for them on client projects, what they buy beyond a lower bill, and where they still fall short.
For a while the smart default was simple: use the biggest model you can afford and accept the bill. That default is ageing badly. Small models, including open ones you can run yourself, have got good enough that for a meaningful slice of real work the frontier is no longer the obvious choice. On several recent projects the better engineering decision was the smaller model, and not only because it cost less.
It is worth being clear about where that holds and where it does not, because the failure mode in both directions is expensive.
The Frontier Model Is a Default, Not a Requirement
The frontier is where you start when you do not yet know how hard your problem is, which is reasonable for a prototype. The mistake is leaving it there once you do know. A great deal of production work turns out to be narrow and well-defined, and narrow well-defined tasks are exactly what smaller models handle competently. The capability gap that matters on a hard reasoning problem often vanishes on the routine jobs that make up most of a product.
What Small Models Are Genuinely Good At
The sweet spot is high-volume, well-scoped work. Classification into a known set of categories. Extracting structured fields from messy text. Short rewrites and tone adjustments. Routing a request to the right place. Simple, bounded question answering over a small context. For tasks like these, a small model produces output users cannot distinguish from a frontier model, runs faster, and costs a fraction as much. When the work is repetitive and the rules are clear, size buys you very little.
Cost, Latency, and the Data Question
The savings are the headline, and they are real, often an order of magnitude on the right task. But two other benefits matter as much to certain clients. Latency is the first: smaller models respond faster, which is the difference between a feature that feels instant and one that makes users wait. Data residency is the second: an open model you host yourself means sensitive data never leaves your environment, which for clients in regulated sectors is not a nice-to-have but the thing that decides whether the feature is allowed to exist at all. For some engagements that single property outweighs every other consideration.
Where They Fall Down
Small models are not a free win, and pretending otherwise leads to the opposite mistake. They struggle with genuine multi-step reasoning, with tasks that need broad world knowledge, and with the long, messy inputs where a model has to hold a lot together and stay coherent. They are easier to knock off course with an awkwardly phrased request. Push a small model past its range to save money and you will pay it back in wrong answers, retries, and the human time spent cleaning up, which usually costs more than the frontier model would have. Hosting an open model yourself also brings real operational work that the savings have to justify.
How We Choose
We decide per task, with evidence, not per project by reputation. Take the specific job, run a small model against real examples, and look at where it breaks. If it holds up, the savings, the speed, and sometimes the data control make it the right call easily. If it breaks on cases that matter, we move up without hesitation, because a cheaper model that gets it wrong is not cheaper. The frontier and the small model are tools, and the skill is matching each one to the work it suits rather than committing to either as a policy.
The interesting shift is how much work now sits comfortably in small-model territory, and how fast that share is growing. A lot of features built on the frontier today could run on something far smaller, and the gap that justified the bigger model is closing on exactly the routine work most products are made of.
Curious whether part of your AI feature could run on a smaller or self-hosted model? We can test it and tell you what you would gain and what you would give up.
Related articles
How We Pick a Model: Frontier, Mid, or Cheap
The instinct is to reach for the most capable model and stop thinking. That instinct quietly wastes money and adds latency. Here is the decision we actually run for every AI feature we build.
6 min readWhen a New Model Drops, What Changes for Clients
A new model lands every few weeks, each one billed as a leap forward. Upgrading is not free and not always an improvement. Here is how we decide whether to move a client's feature onto a new model, and when we wait.
6 min readCutting Your Token Bill Without Cutting Quality
When an AI feature costs more than it should, the answer is rarely a cheaper model. Here are the techniques we reach for first, in the order we reach for them, and the savings they actually deliver.
6 min read