December 17, 2024

Agent Contracts: A New Approach to Agent Evaluation

Technical guide

As AI agents step out of the sandbox and into the real world, traditional metrics are no longer enough to evaluate their performance. Unlike simple LLM apps, agents interact with and modify their environment, requiring us to measure not just their outputs, but their actual behaviors and decision-making processes. In this post, we will introduce Agent Contracts — a new framework for understanding and verifying AI agent performance.

Why traditional metrics don't work for agents

Traditional LLM metrics are inadequate for evaluating AI agents, particularly from a developer's perspective. Unlike simple LLM applications, agents operate in dynamic environments and take actions that modify these environments in complex ways. Traditional metrics, such as LLM-as-a-Judge approaches, fall short because they only measure what the agent outputs (like answer correctness) rather than evaluating the full scope of its actions and their consequences. This limitation creates a critical need for a new evaluation paradigm in agentic AI applications.

Consider a customer support agent handling a refund request. The agent might respond with: "Yes, the refund has been processed!". While this response appears correct, it doesn't guarantee that the refund was actually processed correctly in the system. The agent could be hallucinating its response without having taken the necessary actions. Understanding the underlying processes and actions is crucial for developers to build reliable and effective agents.

Traditional metrics also struggle with scenarios where there isn't a single "correct" output. For instance, in web search applications, the appropriate response might change constantly as web content updates. Developers need to understand not just what output their agent produces, but how it arrives at its decisions and what actions it takes along the way.

‍

Agent Contracts: A New Evaluation Paradigm

Drawing inspiration from formal methods in software engineering, we're introducing a new framework called Agent Contracts to measure and verify agentic systems. This framework provides developers with deeper insights into agent behavior and decision-making processes.

‍

Agent Contracts define two key levels of evaluation:

Module-Level Contracts:These contracts specify the expected input-output relationships, preconditions, and postconditions of individual agent actions. This granular level of specification helps developers understand and verify each component of their agent's behavior.
Trace-Level Contracts:These contracts capture the expected sequence of actions—mapping out the agent's complete journey from start to finish. This higher-level view ensures that agents follow appropriate processes and maintain consistency across multiple interactions.

Contracts are scenario-specific, becoming relevant only when certain conditions are met, such as when a user requests a refund. This contextual approach allows for more precise evaluation of agent behavior in specific situations.

‍

Example: Customer Support AI Agent

To illustrate these concepts, let's examine a customer support agent handling a refund request.

The Agent Contracts would define:

1. Module-Level Contract

Precondition: User submits a refund request
Postconditions: Agent successfully triggers the refund process through appropriate database updates

2. Trace-Level Contract

The agent must follow a specific sequence:

Call the GetOrder tool to retrieve accurate order details
Use the ProcessRefund tool with the correct order information
Collect and document customer feedback after the refund process

This structure can be visualized as a building: Module contracts establish the rules for entering and exiting individual rooms, while Trace contracts map out the complete journey through the building.

Why this matters?

By implementing scenario-specific contracts, developers can ensure agent reliability, traceability, and correctness, even in complex, multi-step interactions. This approach enables developers to understand not just the final outputs, but the complete decision-making process and action sequence of their agents. This deeper understanding is essential for improving agent performance, debugging issues, and building more sophisticated AI systems.

This idea builds on what I studied during my PhD on agent reliability, tackling the challenge of evaluating AI systems in real-world settings.

‍

Join us in the Beta

We're developing a library to enable Contract-based evaluation and observability for AI agents. If you're working on dynamic agents and want to explore a new standard for measuring and verifying their performance, we invite you to collaborate with us.

Technical guide

Writen by

Pasquale Antonante

Like what you read? Share with a friend

Explore other topics