475 Cumulus
ArticleUpdated May 31, 2026

RAG without the platform rewrite

How to add retrieval over your existing data without standing up a separate vector platform or pausing the product roadmap.

ragarchitecture

Retrieval-augmented generation (RAG) is often sold as a new platform decision: pick a vector database, build an ingestion pipeline, deploy a separate search service, then wire a chat UI on top.

For most product teams, that is the wrong framing. You already have databases, APIs, search indexes, and authorization. RAG should plug into those boundaries — not replace them.

Integrated RAG vs. separate platform

Separate platform (common pitch)

1.New vector DB
2.Ingestion pipeline
3.Separate search API
4.Detached chat UI

Integrated path (recommended)

Your app + auth

Existing session

Retrieval middleware

SQL, APIs, search you own

LLM + citations in UI

Embedded in product views

The integrated path reuses auth, data access, and deployment you already operate.

Why the separate-platform pitch is tempting

Vendors bundle vector storage, chunking, and a chat widget because it is easy to demo. For a greenfield project, that can be fine. For an existing product with paying customers, it creates problems:

  • Duplicate auth — a sidecar search service does not know your tenant model
  • Stale data — another pipeline to keep in sync with your source of truth
  • Detached UX — users live in your app; a floating chat widget fights your workflow design
  • Ops overhead — another system to monitor, secure, and on-call for

Integrated RAG treats retrieval as middleware in your application — same deployment, same identity, same observability.

Start from the user workflow

Before choosing Pinecone, pgvector, or Elasticsearch, define the feature in product terms:

  • What question is the user trying to answer?
  • What data do they already have permission to see?
  • Where in the UI does the answer need to appear — inline, sidebar, modal, or action suggestion?

The retrieval layer should assemble context from sources your app already trusts: Postgres rows, document metadata, CRM records, ticket history, internal APIs — scoped per user and tenant.

A concrete example

A support copilot embedded in a ticket view should retrieve: the current ticket thread, the customer's plan tier, relevant help articles, and recent similar resolved tickets. It should not search the entire knowledge base without tenant filters or return answers without citations your agents can verify.

Middleware owns retrieval

Retrieval belongs on the server — after authentication, before the model call. A typical flow:

RAG retrieval flow (server-side)

1. Authenticate

Session / JWT

2. Fetch

DB, APIs, docs

3. Rank & trim

Fit context window

4. Prompt + call

With citations

5. Render

Answer + sources in UI

Retrieval runs after auth — never trust the client to assemble context.

The middleware layer should:

  1. Authenticate the request using your existing session or token
  2. Fetch candidate context from stores the user can access
  3. Rank and trim to fit the model's context window — quality over quantity
  4. Attach citations the UI can render — source IDs, links, or snippets
  5. Log what was retrieved for debugging bad answers

Keeping retrieval server-side prevents clients from bypassing permission checks, makes caching straightforward, and gives support a trail when something goes wrong.

You do not need a greenfield vector stack on day one

Vector search helps at scale, especially for semantic matching over large unstructured corpora. Many integrations start simpler:

  • Structured retrieval — SQL with filters (tenant_id, status, date range)
  • API composition — aggregate context from services you already call
  • Full-text search — Elasticsearch, Postgres tsvector, or your existing search product
  • Hybrid — metadata filters plus keyword search before adding embeddings
Retrieval strategy spectrum
StructuredEffort: Low · SQL filters, API lookups

Best when: Known queries, tabular data

HybridEffort: Medium · Full-text + filters

Best when: Docs + metadata search

VectorEffort: Higher · Embeddings + rerank

Best when: Semantic match at scale

Start left. Move right when structured retrieval stops working — not before.

When to add embeddings

Consider vectors when:

  • Users ask questions that do not match document titles or keywords
  • Your corpus is large enough that brute-force fetch is too slow or expensive
  • You have eval data showing structured retrieval misses too often

Defer vectors when:

  • Most queries map to known entities (accounts, orders, projects)
  • Your content is already well-structured with metadata
  • Team bandwidth is limited — embeddings add indexing, re-embedding on change, and reranking complexity

Citations are not optional

For B2B products, "the AI said so" is not acceptable. Citations build trust, help users verify answers, and give support a starting point for escalations.

Good citation UX:

  • Links or IDs back to source records in your product
  • Snippets that match what was actually sent to the model
  • Clear distinction when no relevant context was found — refuse or ask clarifying questions instead of guessing

Common failure modes

FailureSymptomMitigation
Wrong tenant contextCross-customer data leakageEnforce tenant filter at fetch time, never in prompt alone
Stale documentsOutdated policy answersTie retrieval to source version; surface "last updated" in UI
Over-retrievalSlow responses, high costRank aggressively; cap chunks per source
Under-retrievalHallucinated fill-inEval retrieval hit rate; expand sources incrementally

Ship a thin vertical slice

The biggest mistake is boiling the ocean: index every document, support every question type, launch a standalone chat. Instead:

Thin vertical slice rollout
Week 1–2

One workflow

  • Define user question
  • Pick one data source
  • Server retrieval
Week 3–4

Harden

  • Logging & evals
  • Citation UI
  • Feature flag
Week 5+

Expand

  • More sources
  • Hybrid search
  • Vectors if needed

Ship one end-to-end path before adding data sources or infrastructure.

Pick one workflow, one primary data source, one UI surface. Get it behind a feature flag with logging and evals. Measure answer quality and latency with real users. Then expand retrieval sources and add semantic search only when the data proves you need it.

Eval questions for your first slice

  • Does the answer cite the right source 80%+ of the time on a golden set?
  • What happens when no relevant context exists?
  • What is p95 latency end-to-end — retrieval plus generation?
  • What does it cost per successful resolution at current traffic?

Operating RAG in production

RAG systems decay as content changes. Plan for:

  • Re-indexing or refresh when source documents update
  • Retrieval regression tests when you add new data sources
  • Dashboards for retrieval latency, chunk count, and empty-result rate
  • Feedback loops — thumbs down should tag the retrieval set for review

This is ongoing product operations, not a one-time integration project.

The integration mindset

RAG without the platform rewrite means: use your auth, your data access patterns, your deployment pipeline, and your UI. Add retrieval middleware and citations. Grow complexity only when measured need appears.


Want help scoping RAG for your stack? Get in touch with your auth model, data sources, and target workflow — we will map a thin-slice plan you can ship without pausing the roadmap.