Building AI Agents That Actually Do Useful Work

AI agents are everywhere right now. Multi-step workflows, tool use, autonomous decision-making — the demo videos are compelling and the GitHub stars are impressive. Then you try to put one in production and discover the gap between "impressive in a controlled environment" and "reliable enough to trust with real work".

We have built agents that run in production, and the hard lessons are not where most people expect them.

The Tool Design Is Everything

An agent is only as good as the tools you give it. Vague tool descriptions produce vague tool usage. Tools that do too many things at once produce errors that are nearly impossible to debug. Tools that fail silently produce agents that confidently complete tasks they have actually failed at.

Every tool your agent can call should do one thing, describe that thing precisely, and fail loudly with a useful error message when something goes wrong. Think of it like designing an API for someone who cannot ask clarifying questions.

The most common mistake we see is giving an agent too much surface area too early. Start with the minimum set of tools that can complete the task. Add more as you understand where the agent actually gets stuck.

Observability Is Not Optional

A model processing a multi-step task is a black box by default. You have no idea which tool call introduced the error, which piece of context caused the model to go off track, or why the third attempt succeeded when the first two did not.

Logging every tool call, every model response, and every state transition is not optional — it is the minimum requirement for understanding what your agent is actually doing. Without it, debugging is guesswork, and deploying improvements is faith-based engineering.

Build your observability infrastructure before you start optimising your prompts. You cannot improve what you cannot measure.

Humans in the Loop (For Now)

There is a version of agents where they operate entirely autonomously and handle exceptions gracefully through clever engineering. That version is coming. For most production use cases today, the safer architecture involves a human checkpoint at the points of highest consequence.

This is not a failure of the technology. It is an honest assessment of where the trust threshold currently sits. An agent that pauses and asks for confirmation before deleting records, sending external communications, or making irreversible changes is not a worse agent — it is a safer one that your organisation will actually let you deploy.

As confidence in the system builds, those checkpoints can be removed one at a time.

Evaluation Is a Product Feature

How do you know your agent is getting better? A subjective sense that "it feels more reliable" is not enough. You need a test suite of representative tasks with known correct outcomes, and you need to run it every time you change the prompt, the model, or the tool definitions.

Treating evaluation as a development afterthought is how you end up shipping a regression and not knowing for a week.

Building an agent and hitting the wall between demo and production? Talk to us — this is one of the areas we spend a lot of time in.

Building AI Agents That Actually Do Useful Work

The Tool Design Is Everything

Observability Is Not Optional

Humans in the Loop (For Now)

Evaluation Is a Product Feature

Related articles

A Year of Code Agents in Anger: What Actually Stuck

When Vibe-Coded Software Hits Production: The Patterns We Keep Cleaning Up

Vibe Coding vs Engineers: The Difference Is Still Judgement