Agent Evaluator

Summary: Agent evaluator can perform testing to assess the agent regarding diverse requirements and metrics.

Context: Within an agent, the underlying foundation model and a series of components coordinate to conduct reasoning and generate the responses given users’ prompts.

Problem: How to assess the performance of agents to ensure they behave as intended?

Forces:

Functional suitability guarantee. Agent developers need to ensure that a deployed agent operates as intended, providing complete, correct, and appropriate services to users.
Adaptability improvement. Agent developers need to understand and analyse the usage of agents in specific scenarios, to perform suitable adaptations.

Solution: Fig. 1 presents a simplified graphical representation of agent evaluator. Developers can deploy evaluator to assess the agent regarding responses and reasoning process at both design-time and runtime. Specifically, developers need to build up the evaluation pipeline, for instance, by defining specific scenario-based requirements, metrics and expected outputs from agents. Given particular context, the agent evaluator prepares context-specific test cases test cases (either searching from external resources or generating by itself), and performs evaluation on the agent components respectively. The evaluation results provide valuable feedback such as boundary cases, near-misses, etc., while developers can further fine-tune the agent or employ corresponding risk mitigation solutions, and also upgrade the evaluator based on the results.

Benefits:

Functional suitability. Agent developers can learn the agent’s behavior, and compare the actual responses with expected ones through the evaluation results.
Adaptability. Agent developers can analyse the evaluation results regarding scenario-based requirements, and decide whether the agent should adapt to new requirements or test cases.
Flexibility. Agent developers can define customised metrics and the expected outputs to test a specific aspect of the agent.

Drawbacks:

Metric quantification. It is difficult to design quantified rubrics for the assessment of software quality attributes.
Quality of evaluation. The evaluation quality is dependent on the prepared test cases.

Known uses:

Inspect. UK AI Safety Institute devised an evaluation framework for large language models that offers a series of built-in components, including prompt engineering, tool usage, etc.
DeepEval. DeepEval incorporates 14 evaluation metrics, and supports agent development frameworks such as LlamaIndex, Hugging Face, etc.
Promptfoo. Promptfoo can provide efficient evaluation services with caching, concurrency, and live reloading, and also enable automate scoring based on user-defined metrics.
Ragas. Ragas facilitates evaluation on the RAG pipelines via test dataset generation and leveraging LLMassisted
evaluation metrics.

Related patterns: Agent evaluator can be configured and deployed to assess the performance of other pattern-oriented agent components during both design-time and runtime.