The Data Letter

This was such a great breakdown of the reality of implementing agents! The reliability issues can make improving performance a real challenge. It also gets especially difficult when you can't accurately diagnose why the agent is failing at certain tasks.

Jan 15

You can't fix what you can't trace, and stochastic failures make pattern recognition nearly impossible.

Wire You Networks

Jan 13

Excellent piece Hodman.

Thank you! I have a follow up coming out in the next couple of days!

Wire You Networks

Looking forward to it. I experienced it on a smaller scale when putting together an infrarag system so the read hit home. Lots of lessons learned.

Jan 15

Here you go! https://www.thedataletter.com/p/ai-agent-starter-kit

Il mecenate dell'IA

Jan 12

What this really shows is that agents are not a software problem but a systems problem. Reliability, cost, and integration failures are symptoms of trying to automate ambiguity. The teams that ship are the ones who first decide what must remain deterministic.

You make a good point! The breakthrough isn't a new model, but the engineering discipline to define the deterministic backbone.

Il mecenate dell'IA

Agreed — but I’d push it one step further.

The deterministic backbone isn’t just an engineering safeguard, it’s a governance boundary. Without it, agents don’t fail gracefully — they fail politically, because no one can explain or own their actions.

At that point, the problem stops being technical and becomes organizational.

Dr Sam Illingworth

Jan 11Edited

Fantastic read as ever Hodman. Thanks so much. I was just wondering what you think the best piece of advice would be for how to independently evaluate AI agents without burning through tokens? Is there a way that does this that is effective, because it would seem that the normal processes end up with a situation where the AI agents basically end up marking each other's homework?

Jan 12

Thank you! The "AI grading AI" problem is real and gets worse as you scale evaluation.

To answer your question, I would recommend checking whether your agent did what it was supposed to do, rather than trying to evaluate how it thought through the problem.

For example, if your agent is supposed to book a meeting, you can check: (1) Did it actually create a calendar event? (2) Is the time correct? (3) Did it succeed? You can verify all of this without asking another AI to judge the work.

For more complex situations, create a collection of real examples with known correct answers, then regularly test your agent against them. Track when it starts getting things wrong. This gives you a reliable baseline without needing AI judges.

I'm planning on writing about how I tackle this with Asaura AI over the next couple of months.

Chris Tottman

Jan 11

Incredibly comprehensive as always ! Thanks for sharing ✨