AI Agent Observability with OpenTelemetry: Self-Hosted Monitoring Stack

The Blind Commander

That $47 Tuesday I told you about – one agent, one night, no supervision? I found out from an email. An AWS billing email, at 9 AM, with my coffee going cold on the desk.

Not from my system. My system had nothing to say. No alert. No dashboard. No blinking red light. Just an agent that had been running unsupervised for eight hours and a billing page that told the story in retrospect – like reading about a car crash in the morning paper when you were the one driving.

I had built the agent. I had given it tools. I had given it access. What I hadn't given it was a single way to tell me what it was doing, how much it was spending, or whether any of it was working.

I was a commander giving orders into the dark. The only signal I received was the invoice.

A dashboard you check after the disaster is not a dashboard. It's an autopsy report.

The Blindfold

If that story made you uncomfortable – good. Your setup is identical.

Every AI agent deployment in 2026 shares the same architecture: tokens go in, something comes out, and everything in between is a black box. You know the prompt. You know the response. You know nothing about the journey.

Somewhere in a venture-funded office, a team is celebrating their agent's "successful autonomous deployment." They know it was successful because the agent said so. They know the agent is reliable because it has never reported a failure. It has also never reported anything else.

How much did that session cost? Which tools did the agent call? Which rules did it consult – or did it improvise because it couldn't find any? How long did it spend planning versus executing? Did it hallucinate a function that doesn't exist and then write tests for it?

You don't know. Nobody knows. The agent certainly won't tell you – it'll say "task completed successfully" with the confidence of a surgeon who operated blindfolded and assumes the patient is fine because nobody screamed.

The industry ships agents like submarines without sonar. Full speed ahead, zero visibility, and the crew finds out about the iceberg when the hull cracks.

This isn't a bug. This is the default.

Three Signals

Observability is an old discipline. Infrastructure engineers have been doing this for decades. But when it comes to AI agents, the industry collectively decided to skip the chapter. Too busy scaling. Too busy shipping. Too busy writing blog posts about autonomous agents while running them with less monitoring than a thermostat.

In Mass Effect, you always had a tactical display. Every squad member – position, shields, health, weapon status. Real-time. You didn't command Garrus by hoping he was fine. You knew.

Now imagine Shepard commanding the squad blindfolded. That's the current state of AI agent deployment.

The shepherd's eye sees three signals. Not because three is a magic number – because three is what you need.

Metrics: The Pulse

Numbers. Cold, unfeeling, beautiful numbers.

How many tokens did each agent consume today? What's the cost per model, per task type? How long do sessions take? What's the success rate? How many tool calls per session? How many of those calls failed?

Metrics are the vital signs. Pulse and blood pressure. They don't tell you why the patient is sick, but they tell you – instantly, without ambiguity – that something is wrong. Or that everything is fine and you can sleep.

The $47 Tuesday would have been a $5 Tuesday with one metric: cost_usd_total{agent, model} and an alert at $10. That's it. One number. One threshold. One night of sleep instead of one morning of dread.

Logs: The Story

Metrics tell you that something happened. Logs tell you what.

Every agent session produces a structured record: session ID, agent name, rules consulted, tools used, decisions made, duration. Not console.log("here") – structured JSON that can be queried, filtered, correlated across sessions.

When an agent consults RULE-015 and then calls github.create_pr, that's in the log. When an agent consults no rules and improvises – that's in the log too. And that second entry should make your blood run cold, because an agent without rules is a sheep without a fence. You know where the wolves are.

The log is the interrogation room. Everything the agent did is on the table. The question is whether anyone bothers to look.

Traces: The Path

Metrics tell you that. Logs tell you what. Traces tell you where the time went.

A trace is the X-ray of a session. The full breakdown: planning took 2.1 seconds, implementation took 8.5 seconds (of which 6.2 was an LLM call and 2.3 was creating a PR), review took 3.2 seconds. Every step. Every nested MCP tool call. Every millisecond accounted for.

This is where you discover that your agent spends 40% of its time loading context it never uses. That the "fast" model is actually slower because it retries three times. That the review step calls an LLM that adds cost but catches zero bugs. That your "efficient pipeline" is three agents in a trench coat, each waiting for the other to finish.

Traces don't lie. Your agent's self-reported "task completed efficiently" does.

Four Questions

A shepherd doesn't need a hundred dashboards. A shepherd needs to answer four questions – at any moment, without hesitation. These aren't SRE questions about uptime and latency. These are command questions – the kind a commanding officer asks about a squad in the field.

"How much is this costing me?"

Cost per agent, per model, per task. Today, this week, this month. Predicted versus actual. Budget alerts that fire before the $47 email – not after. If you can't answer this question in under ten seconds, you are not in command. You are a passenger.

"Who is performing and who is wandering?"

Success rate by agent. Average task duration. Failure reasons. Human override rate – how often did someone have to step in and fix what the agent broke? This is where sheep become soldiers or stay sheep. The numbers don't lie, and they don't take it personally.

"What is happening right now?"

Active sessions. Pending approvals. Error rate in the last five minutes. Latency percentiles. The real-time pulse of the system. Not a report you check on Monday morning – a live feed. Because the $47 agent ran for eight hours, and if I had seen the first hour, there would have been no second.

"How well is the system working over time?"

Tasks completed versus escalated. Rule violations. Quality trajectory – is the system getting better or worse? This is the question most people never ask, because answering it requires memory. And memory requires seeing first.

Why This Stack

Three words: open, standard, yours.

Your System (Shepherd Core, MCP Hub, Agents)
                    │
          OpenTelemetry SDK
                    │
              OTel Collector
           ┌────────┼────────┐
           ▼        ▼        ▼
       Prometheus   Loki    Tempo
       (metrics)   (logs)  (traces)
           └────────┼────────┘
                    ▼
                 Grafana
              ┌────┴────┐
         Dashboards   Alerts → Slack

OpenTelemetry is not a choice. It's a default. CNCF graduated project, vendor-agnostic, industry standard. Your telemetry speaks the same language regardless of where it ends up. You are not locked in. Ever.

Why Prometheus, Loki, Tempo – and not the alternatives? Because they're from the same family. Tempo is Grafana Labs' tracing backend – native integration, TraceQL as a query language, zero context switching between metrics, logs, and traces in a single UI. Jaeger is a fine project, but it's a separate ecosystem. When your dashboards, alerts, and exploration live under one roof – that's not convenience. That's operational simplicity.

Loki over Elasticsearch? Elasticsearch's open-source licensing has been a soap opera – Apache 2.0 to SSPL to AGPL, with an AWS fork in the middle. But licensing aside: Elasticsearch is a search engine repurposed for logs. It's a JVM cluster that demands tuning, shard management, and dedicated attention. Loki indexes only labels and stores compressed log lines on cheap storage. It's purpose-built for logs, not repurposed from something else. Simpler to run. Simpler to own.

One collector. One pipe. All three signals flow through a single OTel Collector into their respective stores. One configuration. One place to debug when something breaks.

This is Tenet III: Own Your Stack. This is not a Datadog invoice. This is not someone else's SaaS you're renting month to month, hoping they don't change the pricing. This is infrastructure you control.

And here's the part nobody talks about: this stack is not a single-purpose tool. You deploy it once – and it serves everything. Grafana supports multi-tenant. One tenant is your LLMOps – the shepherd's eye watching the herd. Another tenant is your API backend. Another is your data pipeline. Another is whatever you're building next Tuesday.

You're not buying a flashlight for one room. You're wiring electricity into the building. And if someday you outgrow it – if you decide you need Datadog or New Relic or whatever the enterprise flavor of the month is – you forward the data. OpenTelemetry doesn't care where the signals go. That's the whole point.

Chaos With Better Lighting

So. The eye is open. The dashboards are beautiful. Prometheus is scraping. Loki is ingesting. Tempo is tracing. Grafana panels glow in the dark like a starship bridge.

And what does the eye see?

Chaos.

Agents without rules. Sessions without structure. Tool calls without governance. The same sheep wandering the same fields, except now you can watch them wander in real-time with millisecond precision.

You can see the agent that refactored a module three times at 3 AM. You can see the cost climbing. You can see the trace – planning, implementing, reviewing, implementing again, reviewing again, implementing again. A perfect spiral of wasted tokens, rendered in beautiful telemetry.

You can monitor a dumpster fire in 4K. It's still a dumpster fire.

Observability without governance is surveillance without consequence. A panopticon where the guards watch the screens, nod thoughtfully, and do nothing. The eye sees everything – but seeing is not commanding. A hundred dashboards won't save you if the herd has no rules.

The eye needs a codex.

The stare needs teeth.

What the Eye Remembers

But there's something else. Something quieter.

The eye doesn't just see. It remembers.

Every metric, every log, every trace – they're not just pixels on a dashboard. They accumulate. Patterns emerge. An agent that fails SQL tasks on Tuesdays. A model that costs three times more for the same output quality. A tool that gets called in every session but never contributes to the result.

Over time, the eye builds a picture that no single dashboard can show. Not a snapshot – a trajectory. Not a moment – a history.

The eye sees. And what it sees... becomes memory.

But that is for another dispatch.

What Comes Next

This was the philosophy. The why behind the eye. Three signals, four questions, one stack that belongs to you.

Next comes the practice. A repository you can clone. A docker-compose you can run. Eight dashboards you can see in ten minutes. The Eye, deployed.

Next dispatch: The Eye, Part 2: Wiring.

This blog is optimized for both human readers and LLM consumption. Every article follows a clear heading hierarchy and is available via RSS and llms.txt.

About the author

Building the system. Writing the field manual.

Shepard