475 Cumulus
Guide

When not to use RAG

RAG is the default answer for every AI feature — but often the wrong one. A decision guide for engineering leaders scoping retrieval, tools, and middleware.

ragarchitectureintegration

Retrieval-augmented generation (RAG) is the default prescription for "add AI to our product." Index the docs, embed the chunks, wire a chat UI — demo in a week.

That pattern works when the problem is genuinely open-ended Q&A over a large, changing corpus. It fails — expensively — when the real need is context assembly, live data access, structured output, or a deterministic lookup dressed up as search.

This guide is the decision layer before you stand up a vector database. If RAG is the right fit, see RAG without the platform rewrite for how to integrate it without a platform migration.

Eight situations where RAG is the wrong tool

1. The data is already in the request

If your UI already has what the model needs — the ticket thread, selected record, form fields, dashboard state — pass it in the prompt. Context assembly from product state is not RAG.

Example: "Summarize this support thread." The thread is in the DOM or loaded via your existing ticket API. Chunking and embedding it adds latency, ops burden, and sync problems — for no retrieval benefit.

Use instead: Server-side context builder that formats known entities into the prompt, scoped by RBAC.

2. The answer is a deterministic lookup

When the user asks for order #48291, a specific account, or a policy tagged enterprise-refund, you already know which table or API to hit. That is a query, not semantic search.

Use instead: SQL with tenant filters, your existing search index, or a direct API call → optional LLM step only if you need natural-language formatting.

3. You need live system state, not documents

RAG over static docs goes stale the moment inventory, balances, permissions, or deployment status change. Documents describe the world; they do not authoritatively represent it.

Example: "Can this user access feature X right now?" The answer lives in your auth service — not in a PDF indexed last Tuesday.

Use instead: Tool-calling or middleware that fetches live state from your APIs before (or instead of) generation. See Build an agent with LangChain for the tool pattern.

4. The corpus is small and well-structured

Fifty help articles with clear titles, tags, and metadata rarely need embeddings on day one. Keyword search plus filters often matches user intent well enough — with far less infrastructure.

Use instead: Full-text search (Postgres tsvector, Elasticsearch, your existing search product) with metadata filters. Add vectors only when evals prove semantic gap.

Retrieval strategy spectrum
StructuredEffort: Low · SQL filters, API lookups

Best when: Known queries, tabular data

HybridEffort: Medium · Full-text + filters

Best when: Docs + metadata search

VectorEffort: Higher · Embeddings + rerank

Best when: Semantic match at scale

Start left. Move right when structured retrieval stops working — not before.

5. The task is classification, extraction, or routing

"Route this ticket to Billing," "extract entities from this contract," or "label this email as urgent" are structured output problems. The model maps input → schema. No document retrieval required.

Use instead: One LLM call with a typed schema, validation, and golden-set evals — the same bar as any production API.

6. Latency and cost budgets are tight

RAG adds steps before every generation: embed the query (if semantic), search, rank, trim, attach citations — then call the LLM. For high-volume, low-margin actions (inline suggestions on every field change, autocomplete on every keystroke), that stack rarely fits.

Use instead: Smaller models, caching repeated queries, precomputed summaries, or rules where quality allows.

7. You cannot enforce auth at retrieval time

RAG over a shared document index without per-chunk ACLs is a data-leak waiting to happen. Prompt instructions like "only use data for this tenant" are not a security boundary.

If tenant isolation requires messy workarounds — shared Pinecone namespace, unclear doc ownership, no row-level security on the source — fix data access first or do not ship RAG.

Use instead: Per-tenant fetch through APIs and databases your app already permission-checks. Retrieval middleware runs after auth, not instead of it.

8. "Chat over all our data" with no workflow

A floating widget that answers anything about the company sounds impressive in a sales call. In production it means vague eval criteria, out-of-scope questions, citation UX nobody built, and leadership expecting internal Google.

Use instead: One embedded workflow — ticket detail, admin console, onboarding step — with a defined question set, required citations, and a refusal path when context is missing.

A simple decision framework

Walk through these in order before adding retrieval infrastructure:

  1. Is the answer already in the request or one API call away? → Context assembly or direct fetch. No RAG.
  2. Is it fuzzy search over a large, changing corpus? → Consider retrieval — start structured, not vectors.
  3. Does it require live, authoritative state? → Tools or API fetch, not a doc index.
  4. Is it structured output (classify, extract, route)? → Schema + one LLM call.
  5. Only if step 2 applies and simpler search fails evals → Add RAG (hybrid, then embeddings if measured need).
User question

    ├─ Context in UI/API already? ──────────► Assemble prompt (no RAG)

    ├─ Deterministic lookup? ─────────────────► SQL / search / API

    ├─ Live state required? ──────────────────► Tool-calling / API fetch

    ├─ Structured output? ────────────────────► Schema + generate

    └─ Fuzzy match over large corpus?

           ├─ Structured/hybrid search enough? ► Full-text + filters

           └─ Evals show semantic gap? ──────► RAG (vectors + citations)

Anti-patterns we see in audits

Anti-patternWhat actually happenedBetter path
"RAG all the things"Indexed every Confluence page; 40% retrieval noise; slow p95One workflow, one source, eval before expand
POC chat widgetDemo worked; no tenant filters; blocked by security reviewServer retrieval after auth; citations in UI
Vector DB firstPinecone before proving keyword search insufficientStructured → hybrid → vectors when evals fail
Docs instead of APIsEmbedded CRM export; stale within hoursLive get_account() tool with RBAC
RAG for routingRetrieval over ticket history to pick queueClassifier with schema + fallback rules

These are not hypothetical — they are why "we already tried AI" teams come to integration work skeptical of another platform bet.

Prove you need retrieval before you buy it

Before committing to embeddings, chunking strategy, and re-index pipelines:

  1. Define 20–30 golden questions from real user workflows — not brainstormed edge cases.
  2. Run structured retrieval (SQL, API, full-text) and score hit rate manually.
  3. Only add semantic search when structured approaches miss >20% of golden set and those misses block the feature.
  4. Ship a thin slice — one workflow, one source, feature flag, logging — before expanding.
Thin vertical slice rollout
Week 1–2

One workflow

  • Define user question
  • Pick one data source
  • Server retrieval
Week 3–4

Harden

  • Logging & evals
  • Citation UI
  • Feature flag
Week 5+

Expand

  • More sources
  • Hybrid search
  • Vectors if needed

Ship one end-to-end path before adding data sources or infrastructure.

Eval questions that expose "RAG wasn't needed"

  • Could we answer 80%+ of golden questions with a single API call if we knew the entity ID?
  • Are "misses" actually out of scope questions we should refuse?
  • Is latency dominated by retrieval or generation? (Sometimes the LLM is the wrong cost center entirely.)
  • Would a cached summary of stable docs beat per-request retrieval?

What to use instead (quick reference)

SituationPattern
Context already in UIServer-side context assembly
Known entity lookupSQL / existing search
Live system stateTool-calling or API fetch
Small structured KBFull-text + metadata filters
Classify / extract / routeStructured output + validation
High volume, simple taskSmaller model, cache, rules
Multi-step actions + optional docsAgent with tools; RAG as one tool, not the architecture

When RAG is the right call

Use RAG when all of these hold:

  • Users ask natural-language questions where the relevant source is not obvious upfront
  • The corpus is large enough that loading everything into the prompt is impossible
  • Content changes and must stay grounded with citations
  • You can enforce permissions at fetch time — same bar as the rest of your product
  • Simpler retrieval has failed evals on a representative golden set

That is a narrower box than vendor marketing suggests — and that is the point. Integration judgment is choosing the smallest correct pattern, not the most impressive demo.

The integration mindset

Saying "don't use RAG yet" is not anti-AI. It is how eng teams ship reliably: middleware first, workflow-bound features, evals before infrastructure, and complexity only when metrics justify it.

The teams that win treat retrieval like any other system component — scoped, observable, and reversible — not a platform migration sold as a chat widget.


Scoping a feature and unsure whether RAG belongs in the architecture? Describe the workflow — stack, auth model, and data sources — and we will map the smallest pattern that actually fits, with an honest read on when retrieval is worth the ops cost.