Production AI Agent Observability: Monitoring, Debugging, and Cost Control at Scale
The hardest part of running AI agents in production is not the agent itself. It is everything around the agent — the operational discipline of knowing why a particular run produced a particular result, the cost management discipline of knowing where your token budget is going, and the regression management discipline of knowing whether yesterday’s deployment quietly broke today’s outputs. By the start of 2026, this discipline has a name in mainstream practice: AI agent observability.
In the first quarter of 2026 our blog covered the journey from AI pilot to production. This article is the operational follow-up. Once your agent is in production, observability is the ongoing investment that distinguishes a working AI capability from a stranded one. The teams that get this right run their agents like they run any other production system: with telemetry, with traceability, with budget control, and with the feedback loops that turn production behaviour into evaluation suite improvements.
The article is technical and practitioner-focused. It is written for engineering leads, architects, and AI platform owners who are responsible for production agents and need to think clearly about the observability stack they are building.
What Observability Means for AI Agents
Traditional system observability has three pillars: logs, metrics, and traces. AI agent observability extends this with two more: evaluation telemetry and cost telemetry.
Logs. Structured records of what the agent did. For an LLM-based agent, this includes the prompt sent, the model and parameters used, the response received, the tools invoked and their inputs and outputs, and any intermediate reasoning steps the agent surfaces.
Metrics. Aggregated counters and timers across the agent’s operations. Latency percentiles, error rates, retry counts, token usage rates, tool invocation counts.
Traces. Causal chains across an agent’s run. A single user request to an agent typically produces a tree of LLM calls, tool invocations, sub-agent calls, and external service requests. Distributed tracing makes this tree visible.
Evaluation telemetry. Quality signals on the agent’s outputs. Did the response contain the expected information? Did it hallucinate? Did it complete the task? Evaluation telemetry can be model-graded (an LLM-as-judge approach), heuristic, or user-fed (thumbs up / thumbs down, downstream task success).
Cost telemetry. Token consumption, cache hit rates, model selection patterns. AI cost is non-trivial in production, and it can climb quickly when an agent’s behaviour drifts.
A complete observability stack covers all five. Most teams start with two or three and reach the others gradually. The goal is not to build everything at once; the goal is to build enough to answer the operational questions that matter for your specific application.
The Questions Observability Must Answer
A useful test for your observability stack is whether it answers the questions a senior engineer asks during an incident. The standard set:
“Why did this user’s request produce that response?” You need a trace that captures the input prompt, the system prompt at the time, the tools the agent had access to, the tool calls it made, the intermediate reasoning if exposed, and the final response. The trace must be retrievable by user request ID, by session, or by approximate time window.
“Has this regression happened before?” You need historical traces that allow comparison of similar requests across time. If a user asks a comparable question today versus yesterday and gets a different result, the observability stack must let you compare the two.
“Why did our token bill jump 40% this week?” You need cost telemetry broken down by user, tenant, feature, and model. A cost regression that you cannot localise is one you cannot fix.
“Is the model degrading in quality?” You need evaluation telemetry over a representative production sample, with trend lines that surface gradual degradation.
“Did this deployment break anything?” You need pre- and post-deployment comparisons across the five observability pillars, ideally with automated alerting on significant shifts.
“Is this user trying to break the system?” You need traces that surface adversarial patterns — prompt injection attempts, jailbreaks, attempts to exfiltrate system prompts. Security-relevant signals must be detectable in observability data, not retrieved only after an incident.
If your stack cannot answer one of these, the gap is your highest-priority observability investment.
The Components of an AI Agent Observability Stack
A working stack has six components. Some can be combined into a single tool; some require dedicated systems.
1. Structured Run Logging
Every agent run produces a structured log record. The schema typically includes:
- A unique run ID and a higher-level session or trace ID.
- The user and tenant identifiers (with appropriate privacy controls).
- The timestamp and duration of the run.
- The model and version used.
- The input prompt or user request, with PII redaction applied.
- The system prompt or agent instructions in effect at the time.
- The tools available to the agent and the tools invoked.
- The intermediate reasoning steps where the model exposes them.
- The final response, with redaction applied.
- Token counts (input, output, cached).
- Cost (calculated from token counts and current pricing).
- Any errors, retries, or fallbacks that occurred.
A common mistake is logging this data as a free-text JSON blob rather than as a structured schema. Structured schemas allow querying, aggregation, and the construction of dashboards that free text cannot support. Invest in the schema early.
2. Distributed Tracing
When an agent invokes tools or sub-agents, the run is not a single LLM call. It is a tree of calls. Distributed tracing captures the tree.
The OpenTelemetry GenAI semantic conventions, which have stabilised through 2025 and into 2026, define a vocabulary for AI-specific spans: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, and so on. Implementations of these conventions are available in most major OpenTelemetry SDKs and integrate with established tracing backends (Jaeger, Tempo, Honeycomb, Datadog, Application Insights).
For agentic systems, the tracing is particularly valuable because a single user request can generate a tree of dozens of model calls. Without tracing, the operational visibility into what the agent did is severely limited.
3. Evaluation Telemetry
Evaluation in production is different from evaluation in development. Development evaluations run on curated datasets with known correct answers. Production evaluations run on real traffic, often without ground truth, and must produce signal cheaply enough to apply to a meaningful sample of requests.
The patterns that work:
LLM-as-judge for selected sample. Take a sample of production requests (1-10%, more for high-stakes applications), run an evaluation LLM against the input and output, and produce structured quality scores. Track the scores over time.
Implicit signals from downstream behaviour. If the agent is part of a larger workflow, downstream success is a quality signal. Did the user accept the agent’s suggestion? Did the next step in the workflow complete without intervention? These signals are noisier than explicit evaluation but vastly cheaper at scale.
Explicit user feedback. Thumbs up / thumbs down, satisfaction surveys, reported issues. Lower volume than implicit signals but higher fidelity for the requests that produce feedback.
Hallucination detection for retrieval-augmented agents. Did the response stay grounded in the retrieved context, or did the model fabricate? Specialist hallucination-detection models exist for this purpose; alternatively, a structured LLM-judge prompt can produce useful signal.
The evaluation telemetry must be integrated with the run logs and traces, so a quality regression can be drilled into a specific population of failing runs, then a specific run, then the trace of that run.
4. Cost Tracking and Attribution
AI cost in production is a real budget line. By 2026, mid-sized AI applications routinely consume between £10,000 and £500,000 per month in model API costs, with the high end concentrated in agentic and multi-modal workloads. Cost tracking that aggregates only at the bill level is insufficient.
Useful cost telemetry breaks down by:
- Per request. Direct attribution from token counts to cost.
- Per user and per tenant. For multi-tenant applications, this is essential — both for understanding which tenants are driving cost and for product-level decisions about pricing.
- Per feature. Which product features generate which fraction of total spend.
- Per model. When agents have model selection logic, which model is being chosen for which scenario, and at what cost.
- Per cache state. Cached vs uncached input tokens. Anthropic’s prompt caching (and equivalents from other providers) materially reduces cost — cache hit rate is itself a metric worth tracking.
The cost telemetry must support both real-time alerting (a sudden spike must be visible quickly) and historical analysis (a cohort of users whose cost has crept up over weeks must be identifiable). A combination of streaming aggregation into a real-time dashboard and batch aggregation into a data warehouse is the standard pattern.
5. Feedback Loop into Evaluation Suites
Production telemetry has its highest value when it feeds back into the evaluation suite that protects against regression. The pattern:
- Identify a production failure — a user-reported bad response, a quality score drop, a cost regression.
- Capture the exact production input that produced the failure.
- Add it to the evaluation suite with the expected behaviour.
- Run the evaluation suite on every model upgrade, prompt change, or significant deployment.
Over time, the evaluation suite becomes a curated dataset of the failure modes your application has actually encountered. Each subsequent change is tested against this dataset, which produces strictly better regression coverage than any synthetically constructed test set.
The feedback loop is the single highest-leverage observability investment. It is also the one teams most consistently underinvest in. Build the workflow tooling early — make it easy to promote a production trace into the evaluation suite — and you compound the value of every subsequent investment.
6. Alerting and Anomaly Detection
The final component is alerting. Useful alerts:
- Latency exceeding service-level objectives.
- Error rate exceeding baseline.
- Cost-per-request exceeding budget.
- Evaluation quality score dropping below threshold.
- Specific failure patterns (e.g., a class of tool errors that has historically correlated with model regression).
- Adversarial pattern detection (e.g., a sudden spike in prompt-injection-like inputs).
Alert hygiene applies the same as in any other system: alerts must be actionable, ownership must be clear, and noisy alerts must be tuned out. AI agent stacks are particularly prone to alert fatigue because the failure modes are subtler than in traditional systems — investing in alert quality early pays compounding dividends.
Tooling Choices: Self-Host vs Managed
The observability tool market for AI agents matured rapidly through 2024 and 2025. By 2026 the practical landscape:
Managed AI-specific platforms. LangSmith, LangFuse Cloud, Arize AI, and equivalents. Purpose-built for AI observability, with strong evaluation features, trace visualisation, and integration with popular agent frameworks. Lowest time-to-value; cost scales with usage.
Self-hosted AI-specific platforms. LangFuse and Phoenix (from Arize) both offer self-hosted deployments. Higher operational burden but full control over data residency and retention.
General-purpose observability with AI extensions. Datadog, Honeycomb, Application Insights, New Relic — all now have GenAI-specific tracing support via OpenTelemetry. Strong if you already have these platforms in operation; weaker than purpose-built platforms for evaluation and cost-specific features.
Custom builds on commodity infrastructure. OpenTelemetry collectors emitting to ClickHouse, with a custom dashboard layer. Most flexibility, highest engineering investment.
The right choice depends on:
- Data sensitivity. Highly sensitive prompts and outputs may not be sendable to third-party services even with zero-retention guarantees; this favours self-hosting.
- Existing observability investment. If you have Datadog or Honeycomb across the rest of your stack, AI extensions there reduce operational complexity.
- Evaluation sophistication. Purpose-built AI platforms have stronger evaluation features than general platforms; for evaluation-heavy use cases, this matters.
- Engineering capacity. Custom builds require dedicated platform engineering investment over time.
For most enterprise customers, a purpose-built managed platform combined with an OpenTelemetry feed into the existing general-purpose observability stack produces the best balance of capability and effort.
Cost Control Specifics
Cost control deserves a dedicated section because it is the single most consistent place teams find unexpected spend.
Implement aggressive prompt caching. Anthropic, OpenAI, and other providers offer prompt caching that materially reduces cost on repeated prompt prefixes. Most agent applications can route 60-90% of input tokens through cache hits with appropriate prompt structure. Build this into the agent design from the start, not as a retrospective optimisation.
Use the smallest model that meets the quality bar. Frontier models are expensive. For many agent steps, smaller models — Haiku, GPT-4 mini, Gemini Flash — produce adequate quality at a fraction of the cost. Architect agents with explicit model selection rather than defaulting to the largest available.
Be careful with retries. A failed agent run that retries three times costs four times the budget of a single run. Retry logic must be bounded, and retry triggers must be selected carefully (transient API errors, yes; quality issues, generally no — quality issues should be raised to humans, not retried mechanically).
Watch for cost regressions on model upgrades. When a provider releases a new model, the input and output token counts can shift even when the prompt is unchanged. Validate cost-per-request before rolling a new model into production.
Set budget alerts and hard limits. Streaming budget alerts at the request level, budget caps at the user level (where appropriate), and tenant-level budget caps for multi-tenant applications. The alert structure should catch a runaway loop before it accumulates significant spend.
Common Anti-Patterns
Patterns we have seen repeatedly in production AI deployments:
Logging the prompt as plain text in a non-structured log. Makes querying impossible at scale. Always use structured logging.
No PII redaction before logging. Logs become a privacy liability. Redact at the source, not at the dashboard.
Trace data without cost data. You can see what happened but not what it cost. Always integrate cost telemetry with traces.
Evaluation data disconnected from traces. A failing evaluation must lead you back to the production trace that produced it. Disconnected evaluation surfaces produce signal you cannot act on.
Unbounded retention of prompt content. Prompts and responses contain customer data. Retention policies must be defined and enforced, not implicit.
No tenant-level cost attribution in multi-tenant systems. When a tenant runs up unexpected cost, the operational team must be able to identify and contact them. Without per-tenant attribution, cost issues cannot be addressed cleanly.
A Reference Implementation Sketch
A common production stack for a UK enterprise AI deployment in 2026:
- Agent framework with built-in instrumentation (Anthropic SDK with prompt caching, the LangChain or LlamaIndex equivalents, or a custom orchestration layer that emits OpenTelemetry).
- OpenTelemetry collector receiving traces from the agent framework, enriching them with deployment metadata, and routing to multiple sinks.
- LangFuse or LangSmith as the AI-specific observability platform, receiving full traces and prompt/response content (with appropriate redaction).
- Datadog or Honeycomb as the general-purpose observability platform, receiving traces and metrics for cross-system correlation.
- A dedicated cost dashboard built on the cost telemetry, with real-time alerts and historical drill-down.
- An evaluation pipeline (often built around the LangFuse / LangSmith dataset features or a custom workflow on top) that captures production traces, allows curation into evaluation suites, and runs the suites against deployment candidates.
This stack is not the only viable shape, but it is representative of what works in 2026 for enterprise AI deployments at meaningful scale.
Where McKenna Helps
McKenna Consultants delivers AI agent observability engagements as part of our broader AI-First practice. Typical engagements:
Observability stack assessment. Review the existing observability posture against the components above, identify gaps, prioritise the investments that produce the largest operational improvements.
Stack implementation. Build the structured logging, tracing, evaluation, and cost tracking infrastructure for an AI agent application. Output: a working observability stack with documented operational procedures.
Cost optimisation. A focused engagement on AI cost reduction — prompt caching, model selection, retry tuning, cost attribution. Output: documented cost reduction and ongoing budget controls.
Production readiness review. A short-form engagement assessing whether an AI application is ready to scale to production volumes, with observability being one of the key dimensions reviewed.
If your AI application is in production or approaching production, and your observability stack is ad-hoc rather than designed, contact us — this is one of the most common engagements we run, and the production discipline gap is one of the most consistent friction points across enterprise AI programmes in 2026.