AI Readiness Audits Are Quietly Becoming Most of Our Consultancy Work
A year ago clients hired us to ship features. Now they hire us to tell them whether their codebase can survive the AI feature their CEO already announced. The findings are starting to repeat.
The shape of our consultancy enquiries has shifted in a way that took us a few months to notice. Not the volume — that has been steady. The nature of what people are asking for. Twelve months ago we were being hired to ship things. Today, more often than not, we are being hired to assess whether the thing the company has already promised to ship is actually buildable on the foundation they have.
The label that has stuck for it, in our heads at least, is AI readiness audit. The label that some clients have written into the engagement brief is closer to please tell our CTO we cannot ship this in six weeks. Both are versions of the same job.
Either way, it is interesting enough now to write down what we keep finding.
What clients actually ask for
Almost nobody calls and asks for an "AI readiness audit". That is not a category that exists in their head. The conversations start somewhere else. Something like:
Our CEO told the board we are going to add an AI assistant to the product by Q3. The team has been prototyping but we are not sure the codebase is ready. Can you take a look?
Or:
We have a Copilot-style feature working in a demo branch. Marketing wants to launch it. Engineering wants another quarter. We need a third party to tell us who is right.
Or, the version that has been growing fastest:
We bought a vector database six months ago and built a RAG prototype. We cannot get the answer quality past 70%. Where do we even look?
The common thread is that the AI feature is not the question. The thing underneath the AI feature — the data, the codebase, the observability, the way the team currently ships — is the question. The AI feature is the deadline.
What an audit actually covers, in practice
We spent the first few of these inventing a methodology from scratch every time. The shape eventually settled into roughly four pillars, in the order we look at them:
The data, before the model. Where does the data live, who owns it, how clean is it, can it leave the building, and is it allowed to be sent to a foreign LLM provider in the first place. This is the question that derails the largest number of feature launches. We have lost count of how many "we will ship this in six weeks" prototypes turned out to depend on shipping customer PII to OpenAI without a DPA in place. The honest answer is usually that the legal review will take longer than the build.
The integration surface. Where in the existing product are you actually injecting this AI feature, and is the surrounding code ready to host it. A frequent finding: the codebase has zero feature flags, zero way to A/B test, zero way to roll back a prompt change without a redeploy. Adding an AI feature to a system that has no way to experiment is the most expensive way to do AI integration.
The eval and observability layer. This is where almost every client we have audited has nothing. No test fixtures of representative inputs. No regression suite for prompt changes. No way to attribute a quality drop to a specific change. No telemetry on what the model actually returned. We have written before about why most AI integrations fail in production — the absence of an eval layer is the one nearly all of those failures share.
The cost model. The slide deck says the AI feature will cost £X per user. The actual API spend, multiplied by the actual call frequency under realistic usage, is usually three to ten times higher. Sometimes the unit economics genuinely do not work and the kind thing is to say so before the launch.
A typical engagement is two to three weeks. The deliverable is not a slide deck. It is a written report with code-level recommendations, a prioritised punch list, and an honest answer to can the original deadline survive.
The patterns we keep finding
We have done enough of these now that the same handful of issues come up across very different companies and stacks. They are worth listing because, if you recognise three or more, your codebase is probably not as ready as the planning document suggests.
Prompts hardcoded in source. No version control over the prompt itself. No way to A/B test a change without a deploy. No way to attribute a quality regression to a prompt revision because there is no record of what the prompt used to be. This is the single most common finding.
No representative test set. The team has a dozen ad-hoc examples in someone's notes app. There is no held-out set, no regression suite, no automated way to know that a model swap or a prompt edit made things measurably better. Every model upgrade is therefore a leap of faith.
Retrieval that is not measured. The RAG system retrieves "relevant" documents. Nobody can tell you what fraction of queries retrieve the genuinely correct document in the top-3. Nobody is calculating recall. The chunk size was set once and never tested. (This is why so many RAG systems plateau at the same place — see also RAG is not magic.)
No PII handling on the request path. User-submitted text is sent to the LLM provider without being scanned, redacted, or even logged. The eventual GDPR conversation is going to be expensive.
Cost telemetry as an afterthought. The team has the bill at the end of the month. They cannot tell you what the bill was for yesterday. They certainly cannot tell you which prompt template is responsible for 80% of the spend. This goes from annoying to existential the moment usage scales.
An eval system that the engineers do not actually use. Sometimes there is one — set up by an enthusiastic engineer six months ago, last green build was four months ago, nobody on the team remembers how to run it. We score this as worse than no eval system, because it gives the team false confidence.
The bad reasons clients call
We say no to about a quarter of these enquiries. The pattern is consistent enough that it is worth flagging:
They want us to confirm a decision they have already made. Sometimes the audit brief is structured to produce a particular answer. We have walked away from engagements where it became clear that the only acceptable conclusion was yes, ship it on time. That is not an audit. That is a paid alibi.
The "audit" is being used to delay an inconvenient internal conversation. If two engineering leads disagree about whether the AI feature is shippable, an external opinion can break the tie. But sometimes the senior management's actual problem is that two leads disagree, and bringing in a third party is just avoiding the original disagreement at higher cost. Nice work if you can get it; we usually decline.
It is genuinely too early. If the AI feature is still a sketch on a whiteboard, an audit is premature. Build the prototype, get a feel for the data, then decide whether to engage someone like us. We have nothing to assess if there is nothing to look at.
What we have changed our minds about
When we started doing these we thought the value was in the recommendations — the punch list of things to fix. After a year of doing them, the more valuable artefact is usually the single line at the top answering whether the original deadline is realistic.
Engineering teams are usually capable of generating their own punch list of issues if they have the time and the political cover. What they often lack is an external voice willing to say, in writing, the original timeline is wrong by a factor of three. That is the part that breaks the deadlock. The recommendations follow from that, but they are not what the cheque is paying for.
The other thing we have changed our minds about: the model choice almost never matters. Clients ask which model they should use. The honest answer is that the eval layer, the data hygiene, and the prompt versioning matter ten times more than which provider is on the API call. We get this question every engagement. We give the same answer every engagement. Nobody likes the answer because the answer does not feel like the right amount of work.
If your company has promised an AI feature and the engineering team is quietly worried, the kind of audit described above is one of the standard engagements on our tech consultancy line. Two-to-three week scoped piece of work. Talk to us about whether your situation fits.
Related articles
A Year of Code Agents in Anger: What Actually Stuck
We have used Claude Code, Cursor, Aider, Cline, and most of what is between them on real client work for over twelve months. The tools that survived our rotation are not the ones the launch hype tipped to win.
8 min readWhen Vibe-Coded Software Hits Production: The Patterns We Keep Cleaning Up
Over the past year we have inherited a growing number of codebases built heavily with AI assistance. The failure modes are starting to repeat. Here are the ones we see most often.
8 min readVibe Coding vs Engineers: The Difference Is Still Judgement
The real debate is not whether AI replaces engineers. It is what engineers actually do that AI still cannot, and why those skills are quietly becoming more valuable, not less.
7 min read