This feature allows you to evaluate AI agents. You can use the Gen AI evaluation service to measure and improve the performance, safety, and quality of your agents.
Evaluation types
| Evaluation Type | Use Case | Frequency |
|---|---|---|
|
Rapid Evaluation
|
Testing new agent logic or model changes. | Frequent (Development) |
|
Test Case Evaluation
|
Regression testing against a specific dataset. | Scheduled (CI/CD) |
|
Online Monitoring
|
Tracking quality of a production agent deployment. | Continuous (Production) |
Evaluation workflow
You can evaluate your agents using the Google Cloud consoleor the Agent Platform SDK.
Google Cloud console
To run a basic evaluation for an agent deployment:
- In the Google Cloud console, navigate to the Agent Platform > Agentspage.
-
In the left navigation menu, select Deploymentsand select your agent.
-
Select the Dashboardtab and select the Evaluationsubsection.
-
Click New Evaluation.
-
Follow the prompts to define your test cases and select metrics.
-
Click Run Evaluation.
For more detailed guides, see Run offline evaluations or Continuous evaluation with online monitors .
Agent Platform SDK
The agent improvement workflow is built on the Quality Flywheel, a continuous cycle of evaluation, analysis, and optimization. You evaluate your agent's performance, analyze the results to identify clusters of failures, and then optimize your prompts or configuration to address those issues. This iterative process helps you proactively detect and resolve performance gaps.
Before you begin
- Install the Agent Platform SDKwith the required extensions:
%pip
install
google-cloud-aiplatform [
adk,evaluation ]
- Initialize the Agent Platform SDKclient:
import
vertexai
from
vertexai
import
Client
client
=
Client
(
project
=
"YOUR_PROJECT_ID"
,
location
=
"YOUR_LOCATION"
)
Where:
-
YOUR_PROJECT_ID: your Google Cloud project ID. -
YOUR_LOCATION: your cloud region, for example,us-central1.
1. Define eval cases (User Simulation)
Instead of manually authoring test cases, use User Simulationto generate synthetic multi-turn conversation plans based on your agent's instructions.
# 1. Generate scenarios from agent info
eval_dataset
=
client
.
evals
.
generate_conversation_scenarios
(
agent_info
=
my_agent_info
,
config
=
{
"count"
:
5
,
"generation_instruction"
:
"Generate scenarios where a user asks for a refund."
,
},
)
For more information, see the Agent Platform SDKreference .
2. Run inferences
Execute the eval cases against your agent to capture Traces.
# Generate behavior traces using a multi-turn user simulator
traces
=
client
.
evals
.
run_inference
(
agent
=
my_agent
,
src
=
eval_dataset
,
config
=
{
"user_simulator_config"
:
{
"max_turn"
:
5
}}
)
3. Compute metrics (AutoRaters)
Use Multi-turn AutoRatersto score the captured traces. These raters analyze the full conversation history to verify instruction adherence and tool usage.
# Evaluate the traces using multi-turn metrics
eval_result
=
client
.
evals
.
evaluate
(
traces
=
traces
,
metrics
=
[
"MULTI_TURN_TASK_SUCCESS"
,
"MULTI_TURN_TOOL_USE_QUALITY"
]
)
4. Conduct analysis (Failure Clusters)
The system automatically groups failed evaluations into Loss Clustersto identify key agent issues.
# Identify the top failure patterns in the results
loss_clusters
=
client
.
evals
.
generate_loss_clusters
(
eval_result
=
eval_result
)
5. Optimize the agent
Finally, use the Optimizerservice to programmatically refine your agent's system instructions or tool descriptions based on the failure data.
# Automatically refine the system prompt to fix identified issues
optimize_result
=
client
.
optimizer
.
optimize
(
targets
=
[
"system_prompt"
],
benchmark
=
eval_result
,
tests
=
eval_dataset
)
What's next
- Run offline evaluations
- View evaluation results
- Learn more about the Gen AI evaluation service

