GuideJune 8, 2026

Langfuse for LLM observability — where it fits in your middleware stack

How to trace model calls, debug prompts, and run evals with Langfuse — integrated into server-side LLM middleware, not bolted onto a frontend demo.

Topics:observability middleware integration langfuse

Your copilot shipped. A customer says the answer was wrong. Product asks whether quality regressed after last week's prompt change. Finance asks why tenant ACME's token spend doubled.

If your only signal is console.log("LLM ok") and a provider invoice, you are flying blind. Generic APM tells you the route was slow. It does not tell you which prompt version ran, what context was retrieved, or why the model chose a particular tool.

That gap is what Langfuse addresses — an open-source LLM engineering platform for tracing, prompt versioning, scores, and eval datasets. It is not a model provider and not a replacement for your middleware. It is the observability layer your middleware writes to.

LLM observability stack

Product UI

Copilot, search, actions

Your API

Session + tenant context

LLM middleware

Traces

Request spans, tool calls, retrieval

Metrics

Latency p95, error rate, queue depth

Cost

Tokens & spend by tenant / feature

Logs

Structured events for audit & debug

Model provider

OpenAI, Anthropic, etc.

Dashboards (Langfuse, Grafana, …)

Alerts & budgets

All signals emit from middleware — the same boundary that owns auth, routing, and model calls.

What Langfuse is

Langfuse is an open-source platform for building and operating LLM features. The parts production teams use most:

Capability	What it gives you
Traces & observations	End-to-end view of a user request — model calls, tool invocations, retrieval steps, latency, token usage
Prompt management	Versioned prompts fetched at runtime, linked to the traces that used them
Scores & annotations	Human or automated quality labels on traces — thumbs down, hallucination flag, eval pass/fail
Datasets & eval runs	Golden inputs, regression runs before prompt or retrieval changes ship

Langfuse can run self-hosted (Docker, Kubernetes) or on Langfuse Cloud. Data stays in infrastructure you control — important when traces contain retrieved customer context or tool outputs.

Where it sits in the stack

Your architecture should still look like this:

Client UI → your authenticated API
LLM middleware — auth, rate limits, context assembly, model call
Model provider

Langfuse attaches at step 2. Every model call, retrieval query, and tool handler inside middleware emits structured observations. The client never sees Langfuse API keys.

Request flow through LLM middleware

Client UI

Copilot, search, actions

Your API

Existing auth session

middleware

LLM middleware

Auth, rate limits, logging

Model provider

OpenAI, Anthropic, etc.

Inject tenant-scoped context

Enforce tool permissions

Record tokens & latency

Every model call passes through your stack — not around it.

Think of the split this way:

Middleware enforces policy — who can call the model, what context they get, when to stop
Langfuse records what happened — inputs, outputs, cost, latency, prompt version — so engineers can debug and improve

This is the same separation you already run for databases: Postgres executes queries; Datadog or Honeycomb records them. Langfuse is the LLM-native equivalent — traces are structured around generations, spans, and sessions, not just HTTP status codes.

See LLM middleware explained for the full middleware pattern and What production-ready LLM integration actually means for the observability checklist.

What to trace (and what to tag)

The value of Langfuse is not "we enabled tracing." It is consistent metadata on every request so you can filter production issues in seconds.

At minimum, tag every trace with:

feature — copilot, search-assist, classifier, support-agent
tenantId — for cost and quality per customer
userId — hashed if policy requires; still useful for support escalations
sessionId or threadId — group multi-turn conversations
promptVersion — which managed prompt or template was active
model — provider + model ID actually routed to

For RAG features, add child spans for:

Retrieval query and filters applied
Document IDs or chunk references returned (not necessarily full text — redact per policy)
Whether the model cited retrieved content or went off-script

For agents and tool-calling, trace each tool invocation as a nested observation with permission outcome and latency. When something fails, you need to see which tool, which argument, which API error — not a generic "agent error."

How to wire it into middleware

Langfuse ships SDKs for Python, Node.js, and other runtimes, plus an OpenTelemetry exporter if you already standardize on OTel. The integration point is always the same: your server-side middleware module, not the client.

1. Configure credentials (server-side only)

Store keys in your existing secrets manager or environment — never in frontend bundles or mobile apps:

LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com  # or your self-hosted URL

Initialize once at process startup (FastAPI, Django, Express, a Sidekiq worker, a Kubernetes pod — same idea everywhere).

2. Trace the middleware boundary

Wrap the code path that already owns auth, context assembly, and the model call:

from langfuse import get_client, observe

langfuse = get_client()

@observe(name="copilot-chat")
def handle_copilot(user, message: str, feature: str = "copilot"):
    # Auth and rate limits already ran in the route handler.
    context = fetch_tenant_context(user.tenant_id, user.id)
    messages = build_messages(context, message)

    langfuse.update_current_trace(
        user_id=user.id,
        session_id=f"{user.tenant_id}:{user.thread_id}",
        tags=[feature, user.tenant_id],
        metadata={"promptVersion": "copilot-system-v3"},
    )

    return call_model(
        messages=messages,
        model=select_model(feature, user.tenant_id),
    )

The @observe decorator (or an explicit trace/span in SDKs without decorators) creates a trace around the middleware function. Tag tenant, feature, and session inside middleware — not in the UI. See Langfuse's SDK docs for other runtimes.

For RAG, add nested spans around retrieval before the generation:

@observe(name="retrieve-docs")
def retrieve_docs(query: str, tenant_id: str):
    chunks = vector_search(query, filters={"tenant_id": tenant_id})
    return [{"id": c.id, "score": c.score} for c in chunks]

For tool-calling agents, one observation per tool invocation — permission result, latency, and error message. See Build an agent with LangChain for the orchestration side; Langfuse attaches to the same server-side invoke path.

3. OpenTelemetry path (optional)

If your stack already emits OpenTelemetry spans — from an LLM client library, a custom instrumentation layer, or a provider SDK — add Langfuse's span processor to your OTel pipeline instead of hand-rolling every span:

from opentelemetry.sdk.trace import TracerProvider
from langfuse.opentelemetry import LangfuseSpanProcessor

provider = TracerProvider()
provider.add_span_processor(LangfuseSpanProcessor())

Spans from instrumented model calls flow into Langfuse automatically. Many teams use OTel when multiple services participate in one agent workflow and they want traces in Langfuse and their existing Datadog or Grafana backend.

4. Link prompt versions to traces

When prompts live in Langfuse rather than hard-coded strings, fetch at runtime and record the version on the generation:

prompt = langfuse.get_prompt("copilot-system-v3")
compiled = prompt.compile(product_name="Acme")

with langfuse.start_as_current_observation(
    as_type="generation",
    name="completion",
    model="claude-sonnet-4",
    input=compiled,
) as generation:
    response = call_provider(compiled, user_message)
    generation.update(output=response, prompt=prompt)

When quality drops, filter traces by prompt version and compare latency, token cost, and scores — instead of guessing which deploy introduced the regression.

5. Flush in short-lived workers

On serverless functions, Cloud Run, Lambda, or any process that exits immediately after the response, buffered trace data may never reach Langfuse unless you flush explicitly:

# At the end of the request handler, after middleware returns — not inside @observe
langfuse.flush()

Long-running services (Kubernetes pods, VM workers) usually flush on a background interval — but verify during load testing. Silent trace loss in serverless environments is one of the most common integration gaps.

Langfuse vs. your existing observability

If you already run Datadog, Honeycomb, Grafana, or Sentry, you might ask whether Langfuse is redundant.

Use both — for different questions.

Question	Generic APM	Langfuse
Is `/api/copilot` slow?	Yes — p95, error rate	Yes — but tied to model latency breakdown
Which prompt version caused the regression?	No	Yes
What context was retrieved for this answer?	No	Yes — if you instrument retrieval spans
Did the model call the right tool?	No	Yes — per-tool observations
Token cost per tenant this week?	Only if you built custom metrics	Built around generations

Langfuse also exports via OpenTelemetry, so spans can flow to an existing OTel collector if you want a single backend. Many teams run Langfuse for LLM-specific debugging and forward summary metrics to the platform finance and SRE already use. For cost dashboards, budgets, and unit economics, see Monitoring LLM costs in production. For the full stack beyond Langfuse (logging, OTel, evals, sampling, and tool choice), see LLM observability beyond Langfuse.

The anti-pattern is expecting Datadog alone to replace LLM-native tracing. You can log prompts to stdout, but you will not get prompt versioning, eval datasets, or annotation workflows without building them yourself.

Evals and production rollout

Tracing tells you what happened. Evals tell you whether you should ship the change.

A practical loop — the same one we run on client integrations:

Baseline — capture 20–50 real (redacted) traces from production or staging
Dataset — import into Langfuse as a golden set with expected properties (correct tool, citation present, refuses when data missing)
Change — new prompt, retrieval config, or model route in middleware
Run eval — compare scores before merge
Ship behind a flag — middleware routes a slice of traffic to the new version; Langfuse tags promptVersion so you can diff live metrics

This connects directly to the rollout order in LLM middleware explained: middleware first, first workflow-bound feature second, eval baseline third, then retrieval or agents.

Production readiness checklist

Server-side auth

Tenant-scoped context

Structured logging

Cost per action

Eval pipeline

Provider fallback

Feature flags

Audit on tool calls

Use this as a gate before calling an AI feature GA — not as a post-launch backlog.

Scores do not need to be fancy on day one. A support engineer marking traces "wrong citation" in the Langfuse UI is more valuable than a unused automated metric nobody maintains.

When to add Langfuse

You do not need Langfuse to ship a internal demo. You do need structured observability before external users or paying tenants depend on AI output.

Stage	Minimum observability
POC / demo	Structured log line: feature, user, latency, tokens, model ID
First production feature	Langfuse (or equivalent) on every middleware model call; tenant + feature tags
Second AI feature	Shared tracing module — one integration, all features emit the same metadata schema
Prompt iteration at scale	Prompt management + datasets + eval runs gated in CI

Adding Langfuse after three copilots each log differently means a migration project. Wire it when you extract Layer 2 shared middleware — the same milestone where rate limiting and provider routing centralize.

Common mistakes

Tracing from the browser. Langfuse keys stay server-side. Client-side tracing exposes secrets and captures incomplete context.

Logging the final answer only. The bug is usually in retrieval, tool selection, or prompt assembly — not the last token streamed. Trace the full chain.

No tenant or feature dimensions. A trace you cannot filter by customer is useless in multi-tenant SaaS.

Skipping flush in short-lived workers. Serverless functions and request-scoped workers exit before trace buffers drain. Call flush() (or your SDK's equivalent) before returning — otherwise traces silently disappear.

Treating Langfuse as compliance storage. Traces are debugging artifacts. Define retention, redaction, and access controls like any other log system.

Observability without ownership. Someone on the team — platform, ML eng, or a senior backend dev — should review traces weekly during rollout. Dashboards nobody opens do not count.

How 475 Cumulus uses Langfuse on engagements

We do not sell Langfuse licenses or replace your platform team. On integration projects, we typically:

Instrument the middleware layer you already have (or help you extract one) with Langfuse or OTel-compatible tracing
Define the metadata schema — feature, tenant, prompt version — so your on-call can debug without reading Python notebooks
Stand up eval datasets from real workflow boundaries — support tickets, search sessions, classification batches — not synthetic lorem ipsum
Connect tracing to rollout — feature flags, canary prompt versions, cost alerts per tenant

The outcome is AI features that behave like the rest of your production systems: permissioned, observable, and improvable without guessing what the model saw.

Adding LLM features without observability is how POCs become production incidents. Describe your copilot or agent — we will map middleware, tracing, and eval gates for your stack and auth model.

Browse all resourcesMore on observability