475 Cumulus
GuideUpdated May 31, 2026

What production-ready LLM integration actually means

A practical checklist for engineering leaders — beyond the demo and before you call an AI feature shipped.

integrationmiddlewareobservability

Most teams can get a copilot demo running in a week. Far fewer can answer what happens when the model hallucinates in front of a paying customer, when OpenAI rate-limits you during peak traffic, or when legal asks who accessed what context.

Production-ready LLM integration means your AI layer behaves like any other critical system in your stack — observable, permissioned, and designed to fail gracefully. The model is one component. The integration is the product.

Demo vs. production — same feature, different bar
AreaWeek 1 demoProduction-ready
Auth & permissionsOpen to internal testersRoles enforced server-side
ObservabilityConsole logsTracing, cost, eval dashboards
Failure handlingRetry until it worksFallbacks, timeouts, user messaging
RolloutShip to everyoneFlags → canary → full release
Cost controlUnmetered dev keysPer-tenant budgets & routing

The gap is rarely model quality. It is everything wrapped around the model call.

The gap is not the model

Engineering leaders often evaluate AI features on output quality in a sandbox. That is necessary but insufficient. Production readiness is defined by everything that wraps the model call: identity, data boundaries, cost, failure behavior, and how you roll out changes without surprising customers or your on-call rotation.

A useful framing: week one optimizes for the best possible answer on a curated example. Production optimizes for predictable behavior across messy real inputs — including when the model is wrong, slow, or unavailable.

Auth and permissions

The feature should respect the same roles and permissions as the rest of your product. If a user cannot export billing data in the UI, the copilot should not be able to either — even if the model "figures it out" from context.

That usually means:

  • Server-side middleware that enforces identity before any model call — never trust the client to assemble privileged context
  • Tenant-scoped context assembly — fetch only what the current user is allowed to see, not a broad dump of searchable content
  • Audit logging on tool calls and destructive actions, with the same retention and access controls as your other security logs

Tool-calling needs the same bar

When the model can invoke product APIs — update records, send messages, trigger workflows — those calls must go through your existing authorization layer. A common anti-pattern is exposing raw API keys or broad internal endpoints to the LLM layer. Instead, define a narrow tool surface with explicit permission checks per action.

Request flow through LLM middleware

Client UI

Copilot, search, actions

Your API

Existing auth session

middleware

LLM middleware

Auth, rate limits, logging

Model provider

OpenAI, Anthropic, etc.

Inject tenant-scoped context
Enforce tool permissions
Record tokens & latency

Every model call passes through your stack — not around it.

Observability from day one

You need to see latency, token cost, error rates, and output quality — per feature, per tenant, per model provider. Without this, you cannot answer finance when the bill spikes or product when quality regresses after a prompt change.

At minimum:

  • Structured logs with request IDs tied to your existing tracing (Datadog, Honeycomb, OpenTelemetry — whatever you already run)
  • Dashboards for p95 latency and cost per successful user action — not just per API call
  • Eval pipelines or golden-set checks before prompt and retrieval changes ship to production

What to measure beyond uptime

SignalWhy it matters
Tokens per successful actionUnit economics — copilot cost per resolved ticket, per search, per draft
Retrieval hit rateWhether RAG is finding the right context or hallucinating around gaps
Tool call failure rateAPI timeouts and permission denials surfaced to users
User override / dismiss rateProxy for trust — are people accepting AI output?

Failure modes are designed, not discovered

Provider outages, context window overflows, rate limits, and malformed tool calls will happen. Production-ready integration defines what the user sees in each case before launch — not in the first incident.

Design for:

  • Fallback responses when the primary model is unavailable — secondary provider, cached answer, or honest "try again" messaging
  • Timeouts with partial results where appropriate — a streaming draft that stops cleanly beats an infinite spinner
  • Human confirmation before irreversible actions — deletes, sends, purchases, permission changes

Prompt injection is an integration concern

You cannot prompt-engineer your way out of untrusted input. Production middleware should treat user content, retrieved documents, and third-party data as potentially adversarial. Patterns include input/output filtering, tool sandboxing, and separating system instructions from user-supplied context in the request structure.

Cost control and provider strategy

Unmetered dev API keys hide the real cost curve. Before GA, define:

  • Per-tenant or per-feature token budgets
  • Model routing — smaller models for classification, larger for generation
  • Caching for repeated queries and stable retrieval results
  • Alerts when daily spend exceeds threshold by tenant

Provider-agnostic design is not about avoiding OpenAI or Anthropic. It is about not rewriting product features when you switch or split traffic for cost, compliance, or failover.

Production readiness checklist
Server-side auth
Tenant-scoped context
Structured logging
Cost per action
Eval pipeline
Provider fallback
Feature flags
Audit on tool calls

Use this as a gate before calling an AI feature GA — not as a post-launch backlog.

Testing and evals

Traditional unit tests do not cover probabilistic outputs. Production teams still need regression gates:

  • Golden-set evals — fixed inputs with expected properties (contains citation, refuses out-of-scope request, calls correct tool)
  • CI checks on prompt changes — block deploy if eval score drops below threshold
  • Shadow mode — run new retrieval or prompt path alongside production, compare before cutover

This is not research-grade benchmarking. It is the same discipline you apply to search relevance or recommendation quality — a baseline that prevents silent regressions.

Incremental rollout

Ship behind feature flags. Canary to internal users first, then a percentage of tenants. Measure quality, cost, and support ticket volume before expanding — the same way you would for any high-risk product change.

Incremental rollout phases
Phase 1: InternalEng team + CS
Phase 2: Canary5–10% of tenants
Phase 3: Gradual25% → 50% → 100%
Phase 4: GADefault on

Measure quality, cost, and support load at each stage before expanding.

Questions to ask before GA

  1. Who gets paged when the copilot errors — and do they have a runbook?
  2. Can support see what context was retrieved for a bad answer?
  3. What is the kill switch — per tenant, per feature, global?
  4. How do you roll back a prompt change without redeploying the whole app?

Putting it together

Production-ready LLM integration is not a bigger model or a longer prompt. It is middleware, permissions, observability, failure design, and rollout discipline — shipped incrementally in your repo, on your terms.


If you are scoping an integration for your stack, describe the feature and we will map the architecture — API design, effort estimate, rollout strategy, and what production-ready means for your system.