Analyze evaluation results and failure clusters

Before you begin

To view and analyze evaluation results, ensure you have the following:

Run at least one evaluation as described in Evaluate your agents or Run offline evaluations .
Configured a Cloud Storagebucket for evaluation output if running offline evaluations.
(Optional) If using the SDK to fetch results, ensure your environment is authenticated.

After running an evaluation, Agent Platform provides diagnostic tools to help you identify root causes of failure. You can analyze results at three levels: aggregate trends in the dashboard, semantic groups in failure clusters, and granular logic paths in individual traces.

The evaluation dashboard for online monitors

For agents with active Online Monitors, you can view aggregate performance trends in the dashboard:

In the Google Cloud console, navigate to the Agent Platform > Agentspage.
In the left navigation menu, select Deployments.
Select your agent.

Go to Deployments
Click the Dashboardtab and select the Evaluationsubsection.

Performance Trends:Visualize how scores for metrics like Task Successor Tool Use Qualitychange across different agent versions or timeframes.
Zero State:For agents without active Online Monitors, this view identifies coverage gaps and provides a call-to-action to begin evaluation.

View evaluation results with the SDK

You can programmatically access evaluation results using the Agent Platform SDK. The SDK provides built-in interactive visualizations for Colab and Jupyter notebook environments that display both aggregate summary metrics and per-case detailed results.

After running an evaluation, call .show() on the result object to render an interactive report directly in your notebook:

  from 
  
 vertexai 
  
 import 
 evals 
 , 
 types 
 # Run an evaluation 
 result 
 = 
 client 
 . 
 evals 
 . 
 evaluate 
 ( 
 dataset 
 = 
 eval_dataset 
 , 
 metrics 
 = 
 [ 
 types 
 . 
 RubricMetric 
 . 
 FINAL_RESPONSE_QUALITY 
 , 
 types 
 . 
 RubricMetric 
 . 
 TOOL_USE_QUALITY 
 , 
 types 
 . 
 RubricMetric 
 . 
 HALLUCINATION 
 , 
 types 
 . 
 RubricMetric 
 . 
 SAFETY 
 , 
 ], 
 ) 
 # Visualize aggregate and per-case results in your notebook 
 result 
 . 
 show 
 ()

The visualization includes:

Summary metrics:Aggregate scores across all eval cases, including mean score and pass rate for each metric.
Per-case results:Individual eval case scores that you can expand to inspect detailed results.

The following example shows the summary metrics from result.show() :

Evaluation summary report showing mean scores and standard deviation for each metric.

You can expand individual eval cases to see per-metric scores, rubric verdicts, and rationales:

Per-case evaluation results showing metric scores and individual rubric pass or fail verdicts with explanations.

Interpret evaluation results

Predefined metrics return results in two formats depending on the metric type:

Adaptive rubric metricsautomatically generate rubrics based on the agent's configuration and the user's prompt. Each rubric receives an individual Passor Failverdict with a natural language rationale explaining the judge LLM's reasoning. The overall score represents the passing rate—the proportion of rubrics that received a Passverdict.
Static rubric metricsuse a fixed set of evaluation criteria. For example, Hallucination segments the response into atomic claims and checks each against tool usage evidence. Safety checks for PII, hate speech, dangerous content, and other policy violations. These metrics return a single numerical score (0 to 1).

Identify and triage failures

After reviewing evaluation results, the next step is to identify systemic failure patterns and triage them to improve your agent. Agent Platform provides Automatic Loss Analysis, which analyzes the pass or fail signals from rubric-based metrics, classifies failures into predefined loss patterns, and groups them into semantic clusters. This helps you understand not just that your agent failed, but why and how it failed.

Access failure clusters in the console

Navigate to the Agent Platform > Agents > Evaluationpage.
Select the Evaluationstab.
Click the name of a completed evaluation run to open the report.
If the evaluation detected clusters, they are displayed in the Failure Clusterssection of the report.

Generate failure clusters with the SDK

You can also generate failure clusters programmatically using the generate_loss_clusters method:

  # Generate failure clusters from evaluation results 
 loss_clusters 
 = 
 client 
 . 
 evals 
 . 
 generate_loss_clusters 
 ( 
 eval_result 
 = 
 result 
 , 
 ) 
 # Visualize the loss pattern analysis in your notebook 
 loss_clusters 
 . 
 show 
 ()

The following example shows the loss pattern analysis from loss_clusters.show() :

Loss pattern analysis report showing failure clusters grouped by category with example scenarios and rationales.

Loss pattern taxonomies

The automatic loss analysis classifies each failure into one or more predefined loss patterns. These patterns are designed to be concrete and actionable, mapping directly to specific areas of your agent that you can improve.

There are two predefined taxonomies, each aligned to a specific metric:

Agent task success taxonomy

This taxonomy is used with the Agent Multi-turn Task Successmetric ( multi_turn_task_success_v1 ). It covers high-level agent behavioral failures across hallucination, instruction following, tool calling, tool output handling, and tool quality:

Tool use quality taxonomy

This taxonomy is used with the Agent Multi-turn Tool Use Qualitymetric ( multi_turn_tool_use_quality_v1 ). It focuses specifically on tool calling correctness and tool response handling:

Recommended triage workflow

Use the following workflow to systematically triage evaluation failures:

Start with summary metricsto identify the lowest-scoring metrics across your evaluation dataset.
Drill into per-case resultsto find specific eval cases that failed.
Generate failure clustersto identify systemic loss patterns across failures.
Drill into tracesto find the exact turn or tool call where the failure occurred. In the console, navigate to Agent Platform > Agents > Deployments, select your agent, and open the Tracestab. Select a trace to see the full conversation history and the exact sequence of model inputs, tool calls, and responses.
Identify the root cause—use the loss pattern category to determine whether the issue is a prompt problem, a tool configuration problem, or a data problem.
Apply a targeted fixto the agent's system instructions, tool definitions, or few-shot examples.
Re-run the evaluationand compare scores to verify the improvement.

Analyze evaluation results and failure clusters Stay organized with collections Save and categorize content based on your preferences.