A Year of Code Agents in Anger: What Actually Stuck

Twelve months ago we started defaulting to AI for almost everything: Forge plugin scaffolds, internal tool rewrites, the marketing site you are reading this on, even the boring CRUD work that used to be junior-engineer fodder. The mood at the time was that "agentic coding" was about to retire the IDE assistant — you would describe a feature, the agent would plan it, the agent would code it, and you would review the diff like a pull request from a remote contributor.

We tried that workflow, repeatedly, on real client projects. It mostly does not work that way. Or rather: it works in narrow, well-defined slices and falls apart everywhere else. The tools that survived our rotation are the ones that quietly admitted this and built around it instead of pretending otherwise.

A short tour of what stuck, what we dropped, and the patterns we keep coming back to.

Claude Code is the unflashy default

The honest answer is that Anthropic's CLI is what we open first now. Not because it is the most clever model in any given moment — leaderboards reorder themselves every few weeks and chasing them is a waste of attention — but because the harness around the model is the part that matters in production work, and Claude Code's harness is unusually thoughtful.

The thing nobody quite predicted is that the file-system access, the long-running command execution, and the read-before-write discipline matter more than the marginal IQ points between top-tier models. Give a mediocre model a great harness and it writes useful code. Give a genius model a chat window and it confidently invents APIs that do not exist.

We use it through the terminal, slash commands, and a small set of repository-scoped skills. Most of the value we get is from the unsexy decision to let it read the codebase before it touches anything — which sounds obvious but is exactly the affordance Cursor and Copilot still struggle to give you reliably.

Cursor was the on-ramp, not the destination

Cursor was where most of our team first saw an AI write a real diff that they would actually merge. Twelve months later we use it for fast IDE-bound work — quick refactors, tab completion, the kind of thing where the editor context is already loaded into your eyes. For anything that spans more than a couple of files we mostly tab over to a CLI agent.

The cost trajectory was a factor. Cursor's pricing made sense when token budgets were small and the value of the wrapper was high. As model providers shipped their own first-party agents and the harness gap narrowed, the wrapper-tax got harder to justify. We still pay for it because some engineers prefer the IDE experience, but it is no longer where the team's attention lives.

Aider is the one we recommend to people who want to understand what is happening

If you are uncomfortable with how much "magic" the popular agents perform, Aider is the antidote. Open-source, dead simple, gives you a chat alongside the diff, and forces you to sequence the work yourself. We do not use it as a daily driver, but every engineer who joined the team in the last six months has spent a week with Aider before graduating to the more autonomous tools, and every one of them is a better prompt-writer for it.

There is a generation of engineers who learned to drive AI by watching Cursor auto-complete things into existence. They tend to skip the intermediate skill of talking to the model about what they want before letting it touch the code. Aider trains that habit because it gives you no choice.

The "set it and forget it" agent is still not a real product

We tried, in earnest, to use Devin, the autonomous agent fork-of-the-week, OpenHands, the various "give it a Jira ticket and watch it work" demos. None of them survived more than a sprint of contact with a real codebase.

The problem is not intelligence. The problem is that real codebases have inherited weirdness — a build pipeline that breaks if you change two things at once, a deploy step that requires a manual key rotation, a test suite that has a known flake nobody fixes — and an agent without a human in the loop will spend hours fighting one of those instead of routing around it. The hour-rate maths is unforgiving once you realise the engineer-time saved on the easy parts is destroyed by the engineer-time burned untangling the agent's confident-but-wrong attempts on the hard parts.

We will keep an eye on this category. We do not currently have any of these tools in our stack.

MCP changed the negotiation, quietly

The most interesting development of the last six months was not a model release. It was the Model Context Protocol. Suddenly the question stopped being "which model has the best agent" and became "which tools can we plug into whatever model we already use".

Practically: an MCP server for our deployment tooling, an MCP server for the Atlassian Marketplace API, a tiny one for our internal docs. The agent we use does not care which it talks to. We can swap the model underneath without rewriting the integration. This is the kind of standards-emergence that quietly moves a category from interesting to load-bearing.

The flipside is that an MCP server is also a great place to leak something you did not mean to. We treat them like any other piece of infra now — change-controlled, scoped to least-privilege, audited.

What we changed our minds about

When we wrote about LLMs as a development tool in April, the headline was "they help, but read the diff". A year of harder use has changed two specific opinions:

We were too generous about test generation. AI-written tests are fine when the agent has just written the implementation, because the model has internalised the contract. AI-written tests on a codebase the agent did not author are usually shallow — assertion-rich, behaviourally thin, and they pass the vibe-coded postmortem test every time. We now ask the agent to propose tests and a human writes them.

We were too dismissive about long-form refactors. With a million-token context and a careful sequencing prompt, a current-generation agent can do a coherent rewrite of a 50-file module in a way that was unthinkable a year ago. The catch is the sequencing — describing the order of operations, the invariants you do not want broken, and the read-only files. That is not a prompt; it is a planning artefact. When we treat it like one we get clean refactors. When we ad-lib it we get the silent behaviour changes we have warned about before.

The honest summary

Most of the wins are smaller than the demos suggested. The agent does not replace an engineer; it gives one engineer more leverage on the bits of the work that were always slow for boring reasons — typing, reading unfamiliar code, scaffolding, regenerating tedious test fixtures. It does not give you more good judgement. The hard parts of the job remain hard.

Where it has genuinely changed our practice is at the boundary between idea and prototype. The "let me just try it" cost has collapsed. We say yes to scoping calls now that we would have politely declined a year ago, because building the spike to know whether something is feasible takes an afternoon instead of a fortnight. That has been worth more to us than any single tool — and it is the thing the discourse never quite lands on.

Building or rescuing an AI-heavy codebase? Talk to us — auditing, untangling, and re-shipping AI-assisted projects is a recurring engagement on our tech consultancy line.

A Year of Code Agents in Anger: What Actually Stuck

Claude Code is the unflashy default

Cursor was the on-ramp, not the destination

Aider is the one we recommend to people who want to understand what is happening

The "set it and forget it" agent is still not a real product

MCP changed the negotiation, quietly

What we changed our minds about

The honest summary

Related articles

When Vibe-Coded Software Hits Production: The Patterns We Keep Cleaning Up

Vibe Coding vs Engineers: The Difference Is Still Judgement

Vibe Coding, Honestly: What It Is, What It Isn't, and Where It Breaks