Agent observability tools

The problem

Agents are opaque. You see the output, not the conversation that produced it. A surprising amount of “this AI is being dumb” turns out to be “the AI saw something different from what I assumed,” and you only find that out if you can look at the actual traffic.

The approach

I built three of these, and the shape kept changing as I went:

llm-trace came first: a local gateway that traffic routes through, with a trace viewer on top, redaction, and session replay. Good for understanding a whole multi-step run rather than a single call.
llm-inspector was the pivot: a desktop app built around a turn-based view of a session, with side-by-side request comparison, so you can diff a prompt that worked against one that did not.
agent-kit is where it generalised, and the one I actually use now: a local proxy in front of Claude that records every call and surfaces it in a dashboard, with spend tracking, trace browsing, and the tools, MCP servers, instructions, and skills a run touched.

The agent-kit dashboard trace view: one session shown turn by turn, with each request, response, tool call, and result listed down the page. — agent-kit's dashboard: one session, turn by turn, with every request, response, tool call, and result.

Technical challenges

Friction is the feature. If checking “what did the last call look like” takes more than a few seconds, I will not do it, and I am back to guessing. Every design decision bent toward keeping that cost near zero.
Local-first by default. Everything stays on my machine, so I never think twice about routing my traffic through it.

The dashboard's spend view: cost totals for today, the last seven days, and all time, with a list of the most expensive sessions. — Cost per session turned out to be a great way to catch an agent stuck in a loop.

Outcome

Three working tools, and a clear lesson from building the same thing three times: the real feature is not the dashboard or the desktop UI. It is reducing the friction of seeing the last call to almost nothing.

What I’d do differently

Get to the agent-kit model sooner. The gateway and the desktop app each taught me something, but a low-friction local proxy with a dashboard is what I actually reach for. I spent a while on narrower tools before landing there.