← All posts

I kept building tools to watch my AI agents

Three takes on the same itch: when an agent does something weird, I want to see exactly what it sent and got back.

Every time I work with coding agents, I hit the same wall. Something goes sideways, and I have no real idea what the model actually saw. You get the output, not the conversation that produced it. So I kept building variations of the same tool: a way to watch the traffic. Three of them, and the shape kept shifting as I went.

llm-trace

It started with llm-trace. Rather than instrument a single tool, it is a local gateway that traffic routes through, with a trace viewer on top. It redacts secrets, and it lets me replay a session to see how it unfolded. This was the version for when “show me the last call” grows into “let me understand this whole multi-step run.”

llm-inspector

Then I pivoted to llm-inspector, a proper desktop app built around a turn-based view of a session. The feature I wanted most here was comparison: put two requests side by side and see what changed. When an agent works on one prompt and fails on a nearly identical one, the diff is where the answer usually hides.

agent-kit

Eventually it generalised into agent-kit, and this is the one I actually use day to day. It puts a local proxy in front of Claude, records every API call, and surfaces them in a dashboard. You run agent-kit trace, it spawns the proxy and launches Claude, and from then on every request and response is captured. Git hooks index things automatically so I do not have to think about it.

The dashboard turned into more than a flat list of calls. It groups traffic into sessions and estimates what each one cost from token usage, so I can see spend for today, the last seven days, and all time, and which sessions were the expensive ones. That alone changed how I work. Watching a single session quietly run up a dollar is a fast way to notice an agent stuck in a loop. Alongside the cost view it lets me browse the traces themselves and pulls out the other things a run touches: the tools it called, the MCP servers, the instructions it was given, the skills it used.

The trace detail view: one session shown turn by turn, with each request, response, tool call, and result listed down the page.
One session, turn by turn. This is the view I open when an agent did something I did not expect.
The dashboard's spend view: cost totals for today, the last seven days, and all time, with a list of the most expensive sessions.
The spend view. Cost per session is a surprisingly good way to spot an agent stuck in a loop.
The dashboard's instructions view, showing AGENTS.md, CLAUDE.md and other agent instruction files and a tool to generate them for different agents.
It also keeps the instruction files (CLAUDE.md, AGENTS.md and the rest) in sync across different agents.

The whole design goal was low friction. If checking “what did the last call look like” or “why did that session cost a dollar” takes more than a few seconds, I will not do it, and then I am back to guessing.

It is local first. Everything sits on my machine. There is also a small backend I built, agent-hub, that the kit can optionally push traces to so they live somewhere central rather than only on one laptop, but the local dashboard is the part I actually use day to day.

Building the same thing three times

I did not set out to build three tools. Each one started because the previous one did not quite fit the next situation, and the idea kept getting more general: a gateway, then a desktop app, then a proxy that just sits there and records. Doing it three times taught me what the actual feature is, and it is not the dashboard or the desktop UI or the gateway. It is reducing the friction of seeing the last call to almost zero, and keeping all of it local so I never think twice about routing my traffic through it.

Agents are opaque by default. A surprising amount of “this AI is being dumb” turns out to be “the AI saw something different from what I assumed.” You only find that out if you can look.