TLTD #23 - Of AI Agents and Snake Oil

An evaluation framework for Agentic Systems

Mar 31, 2025

“2025 will be the year of AI agents“
— the Internet

It doesn’t matter where you look. If you spend any amount of time with tech people, you’ll hear about agentic AI and how it’s the next big thing.

But what is agentic AI? I like NVidia’s description

Agentic AI uses sophisticated reasoning and iterative planning to autonomously solve complex, multi-step problems — Nvidia

Simple, right? You may have already come across some of these systems. Two notable examples are OpenAI’s DeepResearch, Operator and Github Copilot’s Agent mode.

Due to the popularity and novelty of this topic, there are many vendors selling agentic solutions, frameworks and platforms, promising the world, but that perhaps don’t live up to expectations. Moreover, it can be hard to cut through the hype if you don’t know what you’re looking for.

So I decided to build a simple multi-agent system and take notes on what I found hard and complicated. The solution was written in Java, using Akka and Anthropic’s Model Context Protocol (MCP). As a result of building this system, I’ve learned a few things about AI agent frameworks that I hope will save you time (and money!) as you start looking into this space.

Do you even need Agents?

Before diving into the evaluation framework, let’s address the elephant in the room: do you actually need AI agents or are you just looking for a nail for your agentic AI hammer?

Simple is often best

Yes, the solution I built works. Yes, it uses tool calling. Yes, it uses MCP. But I consider it a contrived example.

In particular, Tool calling isn’t needed at all and I simply added it for educational purposes. The same outcome could have been achieved through:

Well-crafted prompts with clear instructions
Simple chaining of LLM calls without complex orchestration
Basic integration patterns rather than full agent architectures

are you just looking for a nail for your agentic AI hammer?

The cost of incidental complexity

Agentic systems can introduce significant overhead. Building robust inter-agent communication in a distributed system is hard. Unless your use case specifically requires autonomous decision-making across multiple specialised components, this overhead rarely justifies the benefits.

Ask yourself these questions before proceeding with an agent architecture:

Does your solution need the ability to do multi-step planning through advanced reasoning?
Do you require specialised capabilities that can’t be handled by a single model?
Is there a need for persistent memory beyond what can be managed in a conversation history?
Would your solution benefit from parallel processing of related tasks?

If you answered “no” to at least two of these questions, you likely don’t need a multi-agent system.

Now, for those use cases that might benefit from it, let’s explore how to evaluate the options...

The AI Agent Evaluation Framework

1. Workflow Orchestration

The backbone of any agent system lies in how it manages the flow of information and tasks between components. This goes beyonds just point to point connections and into enhanced workflows to increase the overall usefulness of the system.

Key considerations:

Inter-agent communication : How do agents talk to one another? Is there a standardised protocol, or are you dealing with proprietary methods that create vendor lock-in?
Workflow topology: The system should support multiple patterns:
- Serial workflows for step-by-step processing
- Parallel workflows for concurrent operations
- Aggregation patterns for synthesising multiple outputs
Handoff mechanics: The transition points between agents are often where systems break down. How cleanly can one agent pass its context and outputs to another?

In my implementation with Akka, the actor model provided robust asynchronous message-passing semantics, but I still needed to carefully design the message structure to ensure proper context preservation.

2. Tool calling and interoperability

AI agents become truly powerful when they can interact with external tools. This dimension evaluates how effectively a framework supports these interactions.

Key considerations:

Model compatibility: Not all LLMs support function calling or tool use. Your framework needs to account for these differences or standardise across models.
Tool specification: Different LLMs expect tools to be defined in different formats. In my case, I had to write an adapter from MCP tool specs to OpenAI’s function calling format. (OpenAI has recently announced they will be adopting MCP for their Agents SDK. Hopefully others will follow suit.)
Execution environment: Where do tools actually run? Is execution happening in a secure sandbox, or are you opening security vulnerabilities?

3. Observability and Operational Maturity

An agent system is only as good as your ability to understand what’s happening inside it. This dimension evaluates how well you can monitor, debug, and maintain the system. Multi-agent solutions exponentially grow in complexity, making this dimension vital for your system’s success.

Key considerations:

Workflow visualisation: Can you see the execution path of a request through your agent system?
Tool call instrumentation: Are you capturing what tools were called (and why), with what parameters, and what results they returned?
Smart sampling: Can you intelligently sample communications between agents to identify issues while managing storage costs?
Error handling and recovery: How gracefully does the system handle failures? Can it recover or does it crash entirely?

4. Memory Management and Context Preservation

It turns out that when you work with a large amount of context, managing memory becomes really important. That became clear during the implementation of my proof of concept.

Key considerations:

Context window management: LLMs have finite context windows. How does the system manage information that needs to persist across multiple agent interactions?
Memory structures: Does the system support different types of memory (working memory, episodic memory, semantic memory)?
Information retrieval: Can agents effectively retrieve relevant information when needed, or do they constantly repeat work?

5. Security and Governance

Finally, there’s security and governance. Your risk appetite is a slider and most likely governed by the industry in which you operate. How does your Agents frameworks and platforms assist you in this goal?

Key considerations:

Authentication and authorisation: How are agent permissions managed? What prevents an agent from executing harmful actions?
Prompt injection protection: What safeguards exist against attempts to manipulate agent behaviour through inputs?
Audit trails: Can you trace and account for all actions taken by the system?

Lessons from Building ShallowResearch

ShallowResearch is my own version of OpenAI’s DeepResearch. See what I did there?

This research multi-agent system leverages Akka’s robust concurrency model with the Model Context Protocol’s standardised tool integration approach. The system can gather information from multiple perspectives simultaneously, process it in parallel, and synthesise comprehensive results.

Key components include:

A multi-stage pipeline (Optimiser → Search → […] → Analysis → Publishing)
Specialised agents for searching, synthesising, editing and performing other tasks
Integration with external tools via MCP (such as the Brave Search API)
PDF generation capabilities

The biggest technical challenges I faced were:

Managing context effectively: Ensuring that relevant information flowed properly between agents without exceeding context windows
Tool standardisation: Creating adapters between different tool calling formats
Error recovery: Building resilient workflows that could handle failures at any stage

From what I can tell, I haven’t yet seen a vendor that fully solves these challenges. Each platform excels in certain dimensions while falling short in others. In general, they tend to work well for a demo or POC but in order to build a robust solution that scales, much more is needed.

Whatever solution is being evaluated, it needs to make implementing these systems easier by providing specific advice, features and best practices.

The year of agentic AI

As AI agent frameworks continue to evolve, I expect to see more standardisation around tool calling interfaces and workflow orchestration patterns (OpenAI being the first big player to adopt MCP). The winners in this space will likely be those who prioritise developer experience while maintaining the flexibility needed for diverse use cases.

For organisations considering building or adopting agent frameworks, I recommend starting with a clear understanding of your specific needs rather than chasing the latest shiny technology. Many use cases don’t actually require complex agent systems, and the overhead may outweigh the benefits. Simple is usually best.

Are you pursuing AI agents because they genuinely solve your problem, or because they’re the exciting new paradigm everyone’s talking about?

Resilire

Apr 1

Interesting stuff Leo, thanks. How much time investment was required by you to get to this point? In your experience, do you think those vendors that "tend to work well for a demo or POC" are weeks/months/years away from a "robust solution that scales"? What are the dependencies required to meet/beat those timelines?

Expand full comment

1 reply by Leonardo Borges

1 more comment...

The LeadTech Diet

Discussion about this post