475Cumulus
Guide

LLM observability beyond Langfuse — the full production stack

Langfuse covers traces and evals. Here is what else production teams need: structured logging, OpenTelemetry metrics, quality signals, sampling, canaries, and when to add Braintrust, Phoenix, or your existing APM.

You wired Langfuse into middleware. Traces show prompt versions, tool calls, and token usage. Support can finally answer "what did the model see on this request?"

Then SRE asks why there is no PagerDuty alert when copilot latency doubles. Finance wants a weekly spend dashboard in Grafana. Engineering wants evals to block a bad prompt merge. Langfuse is one layer of LLM observability, not the whole stack.

The four questions production teams ask

Every LLM feature generates four classes of questions. No single product answers all of them well.

QuestionExampleBest layer
What happened on this request?Wrong citation on ticket #8842Trace (Langfuse or equivalent)
Is the system healthy right now?p95 latency up 40% since deployMetrics + alerts (OTel → Datadog, Grafana, Honeycomb)
Should we ship this change?New prompt scores 12 points lower on golden setEval pipeline (eval guide)
Who spent what, on what outcome?Tenant ACME costs $3 per accepted draftStructured logs + cost metrics (cost guide)

Langfuse is strongest on the first row and supports the third via datasets and scores. Rows two and four usually need your existing platform stack or deliberate middleware instrumentation.

Where everything sits

The integration point is unchanged: server-side LLM middleware, not the browser.

Client  →  your API  →  LLM middleware  →  model provider

              ┌───────────────┼───────────────┐
              │               │               │
         structured log    Langfuse trace   OTel metrics
         (Loki / DD)      (debug + evals)   (alerts + dashboards)
Request flow through LLM middleware

Client UI

Copilot, search, actions

Your API

Existing auth session

middleware

LLM middleware

Auth, rate limits, logging

Model provider

OpenAI, Anthropic, etc.

Inject tenant-scoped context
Enforce tool permissions
Record tokens & latency

Every model call passes through your stack — not around it.

Middleware should emit all three from one code path. If each copilot logs differently, you will migrate observability twice.

Layer 1: Structured logging (minimum viable)

Before traces or dashboards, emit one JSON log line per model call with fixed fields. This alone answers most "who spent what" and "which feature is noisy" questions.

Required fields:

  • feature, tenantId, userId (hashed if policy requires)
  • model, inputTokens, outputTokens, latencyMs
  • outcomesuccess, timeout, rate_limited, error, user_rejected
  • requestId or trace ID for correlation with support tickets
import logging
from dataclasses import dataclass

logger = logging.getLogger("llm.middleware")

@dataclass(frozen=True)
class LlmUsage:
    input_tokens: int
    output_tokens: int
    model: str

def log_llm_request(
    *,
    feature: str,
    tenant_id: str,
    user_id: str,
    usage: LlmUsage,
    latency_ms: int,
    outcome: str,
) -> None:
    logger.info(
        "llm.request",
        extra={
            "feature": feature,
            "tenant_id": tenant_id,
            "user_id": user_id,
            "model": usage.model,
            "input_tokens": usage.input_tokens,
            "output_tokens": usage.output_tokens,
            "total_tokens": usage.input_tokens + usage.output_tokens,
            "latency_ms": latency_ms,
            "outcome": outcome,  # success | timeout | rate_limited | error
        },
    )

Ship logs to whatever you already run: Datadog, CloudWatch, Grafana Loki, Elasticsearch. You do not need an LLM-specific vendor for this layer.

Query examples:

# Datadog — tokens by tenant
sum:llm.tokens{feature:copilot} by {tenant_id}.as_count()
 
# Loki — hourly token rate
sum by (tenant_id) (
  rate({app="api"} |= "llm.request" | json [1h])
)

Structured logging is the first thing we add on engagements when there is no observability at all. Langfuse comes next, not instead.

Layer 2: OpenTelemetry metrics (SRE-friendly)

Traces debug one bad answer. Metrics tell you whether the feature is degrading for everyone.

Export counters and histograms from middleware:

MetricTypeUse
llm.requestsCounterVolume by feature, tenant, model
llm.tokensCounterInput vs output dimensions
llm.cost_usdCounterEstimated spend (reconcile to invoice monthly)
llm.latency_msHistogramp50 / p95 / p99 for paging
llm.errorsCounterProvider timeouts, schema failures, budget exceeded
from opentelemetry import metrics

meter = metrics.get_meter("llm.middleware")
token_counter = meter.create_counter(
    "llm.tokens",
    description="LLM tokens by tenant and feature",
)
cost_counter = meter.create_counter(
    "llm.cost_usd",
    description="Estimated LLM spend in USD",
)

def record_otel_cost(attrs: dict, usage, cost_usd: float) -> None:
    labels = {
        "feature": attrs["feature"],
        "tenant_id": attrs["tenant_id"],
        "model": usage.model,
        "outcome": attrs["outcome"],
    }
    token_counter.add(usage.input_tokens + usage.output_tokens, labels)
    cost_counter.add(cost_usd, labels)

Langfuse can dual-write via its OpenTelemetry exporter: LLM-native traces in Langfuse for debugging, aggregated metrics in Datadog or Grafana for alerts. Many platform teams already have on-call runbooks there. Do not build a parallel paging system inside Langfuse unless your SRE team lives there.

Alert on symptoms, not vibes:

  • p95 llm.latency_ms > 8s for 10 minutes
  • llm.errors rate > 2% for a single feature
  • llm.cost_usd per tenant > daily budget (see cost monitoring)

Layer 3: Evals and quality signals

Tracing tells you what happened on one request. Evals tell you whether a prompt or retrieval change is safe to ship across dozens of cases.

The loop:

  1. Baseline production or staging traces (redacted)
  2. Golden dataset with property-based expectations (refusal, citation, correct tool)
  3. Run on every meaningful change in CI
  4. Ship behind a feature flag; compare live metrics by promptVersion

See Eval pipelines for LLM features for runners, scorers, and gates.

Online quality complements offline evals:

SignalSource
Thumbs down / "report incorrect"Product UI → Langfuse score or your DB
User edited the draft before sendingImplicit negative
Session abandoned mid-flowPossible quality or latency issue
Tool confirmation rejectedAgent overreach

A support engineer marking ten traces "wrong citation" in Langfuse weekly beats an unused automated metric nobody maintains.

Layer 4: Cost and unit economics

Invoice totals are too coarse. Production teams track cost per successful outcome: accepted draft, resolved ticket, classified label applied.

That requires the same tenantId and feature tags on logs, traces, and metrics, plus an outcome dimension tied to product events. Full walkthrough: Monitoring LLM costs in production.

Langfuse alternatives and complements

Langfuse is our default recommendation for trace + prompt version + eval dataset in one place. Other tools fill adjacent niches. Pick one LLM-native platform in production; do not run three.

ToolStrengthWhen to consider
LangfuseTraces, prompts, scores, datasets, self-hostDefault for middleware-integrated observability (setup guide)
BraintrustEvals, regression runs, CI integrationTeam thinks in test cases first; eval-heavy workflow
Arize PhoenixOpen-source tracing, embedding visualizationRAG debugging, retrieval quality, drift exploration
LangSmithLangChain / LangGraph integrationOrchestration is already LangChain-native
Helicone / PortkeyGateway proxy + request loggingGateway is your middleware boundary
Datadog LLM ObservabilityManaged GenAI tracingAlready standardized on Datadog for everything

Generic APM (Honeycomb, Sentry, Grafana Tempo) stays valuable for exceptions, HTTP latency, and infrastructure. Use it alongside Langfuse, not as a replacement for prompt-version debugging.

Techniques that matter as much as tools

Consistent metadata schema

Every layer should share dimensions: feature, tenantId, sessionId, promptVersion, model, outcome. Add RAG fields (retrievalChunkCount, topDocIds) and agent fields (toolsInvoked, permissionDenied) as features mature. A trace you cannot filter by customer is useless in multi-tenant SaaS.

Trace the full chain

The bug is rarely the final streamed token. Instrument retrieval, tool selection, permission checks, and prompt assembly as child spans. Logging only the assistant reply hides 80% of copilot failures.

Sampling at scale

EnvironmentGuidance
StagingTrace 100% of requests
Production — low volume copilotTrace 100% initially
Production — high volume classifierSample 1–10%; always trace errors and budget breaches

Sampling keeps storage cost sane without flying blind.

Synthetic canaries

A cron job runs five golden prompts against production middleware every hour. Alert if latency, token count, or eval score drifts. Catches provider outages and silent prompt regressions before users report them.

Session-level grouping

For multi-turn copilots, tag sessionId on every turn. Metrics like cost per resolved thread or turns until abandonment beat per-message token counts for product decisions.

Redaction and retention

Traces are debugging artifacts, not product data. Mask emails, strip secrets, truncate retrieved chunks, set TTL by data classification. Observability that violates privacy policy gets shut down.

RAG-specific observability

Generic generation traces miss retrieval failures. Add spans and metrics for:

  • Query rewrite and filters applied
  • Chunk count returned vs injected into prompt
  • Embed latency and cost (separate from generation)
  • Citation present in output vs retrieved doc IDs
  • "Low confidence" fallback path taken

Phoenix or custom dashboards on embedding distributions help when answers drift after a doc corpus update. Pair with property-based evals: "answer mentions retrieved doc title", "refuses when no chunks above threshold".

Agent and tool-calling observability

Agents add steps, not just tokens. Per tool invocation, record:

  • Tool name and redacted arguments
  • Permission outcome (allowed, denied, needs confirmation)
  • Downstream API latency and error
  • Human approval granted or rejected

When a customer says "the copilot tried to close the wrong ticket," you need which tool, which ID, which policy gate — not a generic "agent error" in logs.

Connect to prompt injection and tool security: security-relevant denials should appear in traces and metrics, not only in stderr.

Maturity model: what to add when

StageMinimum stack
POC / demoStructured log: feature, model, tokens, latency
First production featureLangfuse (or equivalent) on every middleware call; tenant + feature tags
On-call ownershipOTel metrics + alerts in existing APM
Prompt iterationGolden dataset + CI eval gate
Multi-tenant scalePer-tenant budgets, sampling, cost-per-outcome dashboards
Second AI featureShared tracing module — one schema, all features

Adding observability after three features each instrument differently means a migration project. Wire the shared module when you extract LLM middleware.

Common mistakes

Tool sprawl. Langfuse + Braintrust + Helicone + a custom spreadsheet. Pick one LLM-native layer and integrate it deeply.

Tracing from the browser. Keys stay server-side. Client traces are incomplete and insecure.

APM-only. Datadog shows /api/copilot is slow. It does not show which promptVersion regressed.

Langfuse-only. No alerts, no budgets, no CI eval gate. You debug well but ship regressions and cost surprises.

No owner. Someone reviews traces and eval failures weekly during rollout. Dashboards nobody opens do not count.

100% trace volume forever. Storage cost explodes on classifiers running on every row.

How the pieces connect at rollout

Prompt / retrieval change

   CI eval on golden set ──fail──► block merge

       pass

   deploy behind feature flag (promptVersion tagged)

   monitor: OTel alerts + cost per outcome + Langfuse trace sampling

   production feedback → new golden cases → repeat
Production readiness checklist
Server-side auth
Tenant-scoped context
Structured logging
Cost per action
Eval pipeline
Provider fallback
Feature flags
Audit on tool calls

Use this as a gate before calling an AI feature GA — not as a post-launch backlog.

How 475 Cumulus approaches the full stack

We do not sell observability licenses. On integration projects we typically:

  • Define one metadata schema across logs, traces, and metrics
  • Instrument middleware once with Langfuse or OTel-compatible tracing
  • Stand up eval datasets from real workflow boundaries, not lorem ipsum
  • Connect tracing to rollout — feature flags, canary prompt versions, per-tenant cost alerts
  • Dual-write to existing APM when platform teams already run Datadog, Honeycomb, or Grafana

The outcome is LLM features that behave like the rest of your production systems: permissioned, measurable, and improvable without guessing what the model saw.


Langfuse is the right center of gravity for LLM-native debugging. Round it out with structured logs, OTel alerts, eval gates, and unit economics — then you have observability, not just traces. Describe your copilot or agent and we will map the full stack for your middleware and auth model.

More on observability