RAG Is Not Magic: Why Your Retrieval Is Quietly Failing You

Retrieval-augmented generation is the default pattern for anyone who wants an LLM to answer questions about their own content. The story is simple: chunk your documents, embed them, store the embeddings, search for the most relevant ones at query time, and pass them to the model along with the user's question. The tutorials make it look like a weekend project.

The demos are genuinely impressive. The production results are usually not, and the reason is almost always the retrieval step, not the model. The LLM is doing exactly what it was asked to do with the context it was given. The context is the problem.

Here is what we keep finding in RAG systems that are underperforming.

The Retrieval Is the Whole System

The first thing to internalise is that in a RAG pipeline, the LLM is downstream of the retrieval. If the retrieval returns the right chunks, the answer is usually fine. If the retrieval returns the wrong chunks, or the right chunks in pieces that have lost their meaning, the answer will be confidently wrong no matter how capable the model is.

This means that when your RAG system produces a bad answer, the first question is not "is the model hallucinating?" It is "what did we actually put in the context window?".

Logging the retrieved chunks for every query, and being able to inspect them after the fact, is the single most important thing you can build. If your system does not have this, nothing else we say in this post will help you.

Chunking Is Where Most Systems Die

The default approach in every tutorial: split documents into fixed-size chunks, usually 512 or 1024 tokens, with some overlap. This works well enough on a demo dataset with uniform, well-structured text. It falls apart on real data.

The typical failure modes:

Chunks that break mid-thought. A paragraph explaining a concept gets cut in half. The first half says "this is how the system works:" and the second half contains the actual explanation. Depending on what the user asks, retrieval returns one or the other, and neither is useful in isolation.

Chunks that lose their context. A chunk about "the second approach" makes no sense without the chunk about "the first approach". A chunk that says "this is deprecated — see below" without "below" is worse than no chunk at all.

Chunks that aggregate unrelated content. A single 1024-token window that spans three different sections on three different topics, none of which are well-represented in the embedding.

The fixes depend on your data. For structured documents (documentation, wikis, knowledge bases), chunk on natural boundaries — headings, sections, bullet lists — rather than on fixed token counts. For unstructured text, consider smaller chunks with more generous retrieval, so that multiple overlapping chunks can be combined.

Anthropic's guide on contextual retrieval makes an argument worth considering: prepend a short, generated summary of the document context to each chunk before embedding it. This is more work at ingestion time, but it helps with the "chunk lost its context" problem significantly.

Embeddings Alone Are Not Enough

Pure vector search — "find chunks whose embeddings are closest to the query embedding" — has a known weakness. Embeddings are good at semantic similarity and bad at exact matching. "What is the maximum file size?" will retrieve chunks about file sizes generally, but may miss a chunk that specifically says "the maximum file size is 5MB" if the embedding space does not happen to place those close together.

Hybrid retrieval — combining vector search with keyword search (BM25) — consistently outperforms either alone. The vector search catches semantically related content, the keyword search catches exact matches, and the combination is better than either in isolation.

Most production vector databases now support hybrid search natively — Weaviate, Qdrant, Pinecone, and others. Turning it on is usually one line of configuration, and the quality improvement is large enough to be worth it on almost every system we have worked on.

Reranking Is Usually Worth It

The typical vector search returns the top N most similar chunks. These are your best guess. They are often not in the right order, and they often include chunks that are similar but not actually useful.

A reranker takes the top N from retrieval and re-scores them using a more expensive, more accurate model. The output is the same chunks in a different, better order — and you then pass only the top few to the LLM.

This adds latency and cost. On most RAG systems, it also meaningfully improves quality, because the reranker is asking a more precise question than the retrieval step can. Cohere's Rerank and open-source alternatives like bge-reranker are both easy to plug in.

If your RAG system is underperforming and you are not reranking, start there.

The Metadata That Gets Left on the Floor

Every chunk in your index came from somewhere — a specific document, a specific section, a specific version, a specific date. This metadata is useful, and almost nobody uses it properly.

Common mistakes:

Not filtering by recency. A knowledge base that contains documents from five years ago and last month will happily retrieve the old ones. If your domain changes, those old chunks are actively harmful. Filtering to recent content (or boosting recent content) is often a large win.

Not filtering by document type. A RAG system built over "all the company's content" retrieves internal forum posts alongside official documentation. The forum posts are speculative, opinionated, and sometimes wrong. Filtering by source type gives the user what they actually want.

Not filtering by permissions. A RAG system over internal company data can cheerfully return chunks from documents the current user is not authorised to see. This is an access control bug with the severity of any other access control bug, and it is surprisingly common in RAG systems built quickly.

Metadata filtering happens at retrieval time and is almost always available in your vector database. Using it is a matter of plumbing the filter through the query. The improvement in answer quality is usually out of proportion to the engineering effort.

Evaluation Without Vibes

The biggest operational problem with most RAG systems is that nobody has a reliable way to measure whether changes make things better or worse. You tweak the chunking, you try a different embedding model, you add a reranker — and then you ask it a few questions manually and decide whether it seems better.

This is the wrong way to run a system that is going in front of users.

The minimum useful evaluation: a fixed set of representative queries, each with expected ground-truth information that should appear in the retrieved chunks. Run the evaluation after every change. Track how many queries retrieve the correct information.

For answer quality (beyond retrieval), the common technique is LLM-as-judge: use a model to score whether the generated answer correctly addresses the question, given the retrieved chunks. It is imperfect — these scores are not perfectly reliable — but it is much better than nothing, and much better than asking a few questions and going with your gut.

Ragas and similar open-source frameworks have decent defaults for this kind of evaluation, and they are worth using if you do not want to build your own.

Where Retrieval Genuinely Cannot Help

RAG is good at "find the relevant information in my documents and answer based on it". It is bad at a set of things that people try to make it do anyway:

Numerical reasoning across many documents. "What was the total revenue across all these reports?" requires the model to aggregate numbers from multiple chunks. LLMs are unreliable at this. If the answer requires arithmetic on retrieved data, compute the answer outside the model and inject it.

Questions that require the full document. "Summarise this 200-page report". Retrieval returns chunks, not documents. If the user needs the whole document processed, you need a different approach (map-reduce summarisation, long-context models, or document-specific tooling).

Multi-hop reasoning that requires following a chain across documents. "Find the customer that signed the contract with X, and tell me what their usage was last month." This needs two retrieval steps, with the result of the first informing the second. Possible, but not with naive RAG.

Knowing which questions your system should refuse to answer is as important as answering the rest well.

The Uncomfortable Truth

Most RAG systems we see were built in a week and have been in production for a year without meaningful improvement, because nobody ever set up the evaluation infrastructure to tell whether changes were helping.

The first 80% of RAG quality is easy. The next 10% takes real work. The last 10% is a research project. Most teams do not need the last 10%, but most teams also stop short of the 80% they could hit with a week of focused retrieval work. The same pattern shows up in AI integrations more broadly and in agents specifically.

The sequence that tends to produce the biggest wins, in our experience:

Fix chunking so that chunks are semantically coherent.
Turn on hybrid (vector + keyword) retrieval.
Add a reranker.
Use metadata filtering.
Build evaluation so you can measure what you are doing.

This is not exciting work. It is where the quality actually lives.

Have a RAG system that works well in demos and not in production? Get in touch — this is exactly the shape of problem we enjoy, and it sits firmly in our AI & Automation services.

RAG Is Not Magic: Why Your Retrieval Is Quietly Failing You

The Retrieval Is the Whole System

Chunking Is Where Most Systems Die

Embeddings Alone Are Not Enough

Reranking Is Usually Worth It

The Metadata That Gets Left on the Floor

Evaluation Without Vibes

Where Retrieval Genuinely Cannot Help

The Uncomfortable Truth

Related articles

A Year of Code Agents in Anger: What Actually Stuck

When Vibe-Coded Software Hits Production: The Patterns We Keep Cleaning Up

Vibe Coding vs Engineers: The Difference Is Still Judgement