Langfuse for LLM observability — where it fits in your middleware stack
How to trace model calls, debug prompts, and run evals with Langfuse — integrated into server-side LLM middleware, not bolted onto a frontend demo.
Your copilot shipped. A customer says the answer was wrong. Product asks whether quality regressed after last week's prompt change. Finance asks why tenant ACME's token spend doubled.
If your only signal is console.log("LLM ok") and a provider invoice, you are flying blind. Generic APM tells you the route was slow. It does not tell you which prompt version ran, what context was retrieved, or why the model chose a particular tool.
That gap is what Langfuse addresses — an open-source LLM engineering platform for tracing, prompt versioning, scores, and eval datasets. It is not a model provider and not a replacement for your middleware. It is the observability layer your middleware writes to.
What Langfuse is
Langfuse is an open-source platform for building and operating LLM features. The parts production teams use most:
| Capability | What it gives you |
|---|---|
| Traces & observations | End-to-end view of a user request — model calls, tool invocations, retrieval steps, latency, token usage |
| Prompt management | Versioned prompts fetched at runtime, linked to the traces that used them |
| Scores & annotations | Human or automated quality labels on traces — thumbs down, hallucination flag, eval pass/fail |
| Datasets & eval runs | Golden inputs, regression runs before prompt or retrieval changes ship |
Langfuse can run self-hosted (Docker, Kubernetes) or on Langfuse Cloud. Data stays in infrastructure you control — important when traces contain retrieved customer context or tool outputs.
Where it sits in the stack
Your architecture should still look like this:
- Client UI → your authenticated API
- LLM middleware — auth, rate limits, context assembly, model call
- Model provider
Langfuse attaches at step 2. Every model call, retrieval query, and tool handler inside middleware emits structured observations. The client never sees Langfuse API keys.
Client UI
Copilot, search, actions
Your API
Existing auth session
LLM middleware
Auth, rate limits, logging
Model provider
OpenAI, Anthropic, etc.
Every model call passes through your stack — not around it.
Think of the split this way:
- Middleware enforces policy — who can call the model, what context they get, when to stop
- Langfuse records what happened — inputs, outputs, cost, latency, prompt version — so engineers can debug and improve
This is the same separation you already run for databases: Postgres executes queries; Datadog or Honeycomb records them. Langfuse is the LLM-native equivalent — traces are structured around generations, spans, and sessions, not just HTTP status codes.
See LLM middleware explained for the full middleware pattern and What production-ready LLM integration actually means for the observability checklist.
What to trace (and what to tag)
The value of Langfuse is not "we enabled tracing." It is consistent metadata on every request so you can filter production issues in seconds.
At minimum, tag every trace with:
feature—copilot,search-assist,classifier,support-agenttenantId— for cost and quality per customeruserId— hashed if policy requires; still useful for support escalationssessionIdorthreadId— group multi-turn conversationspromptVersion— which managed prompt or template was activemodel— provider + model ID actually routed to
For RAG features, add child spans for:
- Retrieval query and filters applied
- Document IDs or chunk references returned (not necessarily full text — redact per policy)
- Whether the model cited retrieved content or went off-script
For agents and tool-calling, trace each tool invocation as a nested observation with permission outcome and latency. When something fails, you need to see which tool, which argument, which API error — not a generic "agent error."
How to wire it into middleware
Langfuse ships SDKs for Python, Node.js, and other runtimes, plus an OpenTelemetry exporter if you already standardize on OTel. The integration point is always the same: your server-side middleware module, not the client.
1. Configure credentials (server-side only)
Store keys in your existing secrets manager or environment — never in frontend bundles or mobile apps:
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_BASE_URL=https://cloud.langfuse.com # or your self-hosted URLInitialize once at process startup (FastAPI, Django, Express, a Sidekiq worker, a Kubernetes pod — same idea everywhere).
2. Trace the middleware boundary
Wrap the code path that already owns auth, context assembly, and the model call:
from langfuse import get_client, observe
langfuse = get_client()
@observe(name="copilot-chat")
def handle_copilot(user, message: str, feature: str = "copilot"):
# Auth and rate limits already ran in the route handler.
context = fetch_tenant_context(user.tenant_id, user.id)
messages = build_messages(context, message)
langfuse.update_current_trace(
user_id=user.id,
session_id=f"{user.tenant_id}:{user.thread_id}",
tags=[feature, user.tenant_id],
metadata={"promptVersion": "copilot-system-v3"},
)
return call_model(
messages=messages,
model=select_model(feature, user.tenant_id),
)The @observe decorator (or an explicit trace/span in SDKs without decorators) creates a trace around the middleware function. Tag tenant, feature, and session inside middleware — not in the UI. See Langfuse's SDK docs for other runtimes.
For RAG, add nested spans around retrieval before the generation:
@observe(name="retrieve-docs")
def retrieve_docs(query: str, tenant_id: str):
chunks = vector_search(query, filters={"tenant_id": tenant_id})
return [{"id": c.id, "score": c.score} for c in chunks]For tool-calling agents, one observation per tool invocation — permission result, latency, and error message. See Build an agent with LangChain for the orchestration side; Langfuse attaches to the same server-side invoke path.
3. OpenTelemetry path (optional)
If your stack already emits OpenTelemetry spans — from an LLM client library, a custom instrumentation layer, or a provider SDK — add Langfuse's span processor to your OTel pipeline instead of hand-rolling every span:
from opentelemetry.sdk.trace import TracerProvider
from langfuse.opentelemetry import LangfuseSpanProcessor
provider = TracerProvider()
provider.add_span_processor(LangfuseSpanProcessor())Spans from instrumented model calls flow into Langfuse automatically. Many teams use OTel when multiple services participate in one agent workflow and they want traces in Langfuse and their existing Datadog or Grafana backend.
4. Link prompt versions to traces
When prompts live in Langfuse rather than hard-coded strings, fetch at runtime and record the version on the generation:
prompt = langfuse.get_prompt("copilot-system-v3")
compiled = prompt.compile(product_name="Acme")
with langfuse.start_as_current_observation(
as_type="generation",
name="completion",
model="claude-sonnet-4",
input=compiled,
) as generation:
response = call_provider(compiled, user_message)
generation.update(output=response, prompt=prompt)When quality drops, filter traces by prompt version and compare latency, token cost, and scores — instead of guessing which deploy introduced the regression.
5. Flush in short-lived workers
On serverless functions, Cloud Run, Lambda, or any process that exits immediately after the response, buffered trace data may never reach Langfuse unless you flush explicitly:
# At the end of the request handler, after middleware returns — not inside @observe
langfuse.flush()Long-running services (Kubernetes pods, VM workers) usually flush on a background interval — but verify during load testing. Silent trace loss in serverless environments is one of the most common integration gaps.
Langfuse vs. your existing observability
If you already run Datadog, Honeycomb, Grafana, or Sentry, you might ask whether Langfuse is redundant.
Use both — for different questions.
| Question | Generic APM | Langfuse |
|---|---|---|
Is /api/copilot slow? | Yes — p95, error rate | Yes — but tied to model latency breakdown |
| Which prompt version caused the regression? | No | Yes |
| What context was retrieved for this answer? | No | Yes — if you instrument retrieval spans |
| Did the model call the right tool? | No | Yes — per-tool observations |
| Token cost per tenant this week? | Only if you built custom metrics | Built around generations |
Langfuse also exports via OpenTelemetry, so spans can flow to an existing OTel collector if you want a single backend. Many teams run Langfuse for LLM-specific debugging and forward summary metrics to the platform finance and SRE already use.
The anti-pattern is expecting Datadog alone to replace LLM-native tracing. You can log prompts to stdout, but you will not get prompt versioning, eval datasets, or annotation workflows without building them yourself.
Evals and production rollout
Tracing tells you what happened. Evals tell you whether you should ship the change.
A practical loop — the same one we run on client integrations:
- Baseline — capture 20–50 real (redacted) traces from production or staging
- Dataset — import into Langfuse as a golden set with expected properties (correct tool, citation present, refuses when data missing)
- Change — new prompt, retrieval config, or model route in middleware
- Run eval — compare scores before merge
- Ship behind a flag — middleware routes a slice of traffic to the new version; Langfuse tags
promptVersionso you can diff live metrics
This connects directly to the rollout order in LLM middleware explained: middleware first, first workflow-bound feature second, eval baseline third, then retrieval or agents.
Use this as a gate before calling an AI feature GA — not as a post-launch backlog.
Scores do not need to be fancy on day one. A support engineer marking traces "wrong citation" in the Langfuse UI is more valuable than a unused automated metric nobody maintains.
When to add Langfuse
You do not need Langfuse to ship a internal demo. You do need structured observability before external users or paying tenants depend on AI output.
| Stage | Minimum observability |
|---|---|
| POC / demo | Structured log line: feature, user, latency, tokens, model ID |
| First production feature | Langfuse (or equivalent) on every middleware model call; tenant + feature tags |
| Second AI feature | Shared tracing module — one integration, all features emit the same metadata schema |
| Prompt iteration at scale | Prompt management + datasets + eval runs gated in CI |
Adding Langfuse after three copilots each log differently means a migration project. Wire it when you extract Layer 2 shared middleware — the same milestone where rate limiting and provider routing centralize.
Common mistakes
Tracing from the browser. Langfuse keys stay server-side. Client-side tracing exposes secrets and captures incomplete context.
Logging the final answer only. The bug is usually in retrieval, tool selection, or prompt assembly — not the last token streamed. Trace the full chain.
No tenant or feature dimensions. A trace you cannot filter by customer is useless in multi-tenant SaaS.
Skipping flush in short-lived workers. Serverless functions and request-scoped workers exit before trace buffers drain. Call flush() (or your SDK's equivalent) before returning — otherwise traces silently disappear.
Treating Langfuse as compliance storage. Traces are debugging artifacts. Define retention, redaction, and access controls like any other log system.
Observability without ownership. Someone on the team — platform, ML eng, or a senior backend dev — should review traces weekly during rollout. Dashboards nobody opens do not count.
How 475 Cumulus uses Langfuse on engagements
We do not sell Langfuse licenses or replace your platform team. On integration projects, we typically:
- Instrument the middleware layer you already have (or help you extract one) with Langfuse or OTel-compatible tracing
- Define the metadata schema — feature, tenant, prompt version — so your on-call can debug without reading Python notebooks
- Stand up eval datasets from real workflow boundaries — support tickets, search sessions, classification batches — not synthetic lorem ipsum
- Connect tracing to rollout — feature flags, canary prompt versions, cost alerts per tenant
The outcome is AI features that behave like the rest of your production systems: permissioned, observable, and improvable without guessing what the model saw.
Adding LLM features without observability is how POCs become production incidents. Describe your copilot or agent — we will map middleware, tracing, and eval gates for your stack and auth model.
Related resources
LLM middleware: what it is, why you need it, and how to implement it
A practical guide to the server-side layer between your app and the model — auth, rate limits, routing, logging, and the patterns that keep AI features production-ready.
What production-ready LLM integration actually means
A practical checklist for engineering leaders — beyond the demo and before you call an AI feature shipped.
