As AI agents step out of the sandbox and into the real world, traditional metrics are no longer enough to evaluate their performance. Unlike simple LLM apps, agents interact with and modify their environment, requiring us to measure not just their outputs, but their actual behaviors and decision-making processes. In this post, we will introduce Agent Contracts — a new framework for understanding and verifying AI agent performance.
Traditional LLM metrics are inadequate for evaluating AI agents, particularly from a developer's perspective. Unlike simple LLM applications, agents operate in dynamic environments and take actions that modify these environments in complex ways. Traditional metrics, such as LLM-as-a-Judge approaches, fall short because they only measure what the agent outputs (like answer correctness) rather than evaluating the full scope of its actions and their consequences. This limitation creates a critical need for a new evaluation paradigm in agentic AI applications.
Consider a customer support agent handling a refund request. The agent might respond with: "Yes, the refund has been processed!". While this response appears correct, it doesn't guarantee that the refund was actually processed correctly in the system. The agent could be hallucinating its response without having taken the necessary actions. Understanding the underlying processes and actions is crucial for developers to build reliable and effective agents.
Traditional metrics also struggle with scenarios where there isn't a single "correct" output. For instance, in web search applications, the appropriate response might change constantly as web content updates. Developers need to understand not just what output their agent produces, but how it arrives at its decisions and what actions it takes along the way.
Drawing inspiration from formal methods in software engineering, we're introducing a new framework called Agent Contracts to measure and verify agentic systems. This framework provides developers with deeper insights into agent behavior and decision-making processes.
Contracts are scenario-specific, becoming relevant only when certain conditions are met, such as when a user requests a refund. This contextual approach allows for more precise evaluation of agent behavior in specific situations.
To illustrate these concepts, let's examine a customer support agent handling a refund request.
The Agent Contracts would define:
The agent must follow a specific sequence:
GetOrder
tool to retrieve accurate order detailsProcessRefund
tool with the correct order informationThis structure can be visualized as a building: Module contracts establish the rules for entering and exiting individual rooms, while Trace contracts map out the complete journey through the building.
By implementing scenario-specific contracts, developers can ensure agent reliability, traceability, and correctness, even in complex, multi-step interactions. This approach enables developers to understand not just the final outputs, but the complete decision-making process and action sequence of their agents. This deeper understanding is essential for improving agent performance, debugging issues, and building more sophisticated AI systems.
This idea builds on what I studied during my PhD on agent reliability, tackling the challenge of evaluating AI systems in real-world settings.
We're developing a library to enable Contract-based evaluation and observability for AI agents. If you're working on dynamic agents and want to explore a new standard for measuring and verifying their performance, we invite you to collaborate with us.