Analyze evaluation results and failure clusters

Before you begin

To view and analyze evaluation results, ensure you have the following:

Run at least one evaluation as described in Evaluate your agents or Run offline evaluations .
Configured a Cloud Storagebucket for evaluation output if running offline evaluations.
(Optional) If using the SDK to fetch results, ensure your environment is authenticated.

After running an evaluation, Agent Platform provides diagnostic tools to help you identify root causes of failure. You can analyze results at three levels: aggregate trends in the dashboard, semantic groups in failure clusters, and granular logic paths in individual traces.

The evaluation dashboard for online monitors

For agents with active Online Monitors, you can view aggregate performance trends in the dashboard:

In the Google Cloud console, navigate to the Agent Platform > Agentspage.
In the left navigation menu, select Deployments.
Select your agent.

Go to Deployments
Click the Dashboardtab and select the Evaluationsubsection.

Performance Trends:Visualize how scores for metrics like Task Successor Tool Use Qualitychange across different agent versions or timeframes.
Zero State:For agents without active Online Monitors, this view identifies coverage gaps and provides a call-to-action to begin evaluation.

Identify systemic issues with failure clusters

Agent Platform uses an Automatic Loss Analysisrecipe to categorize results from rubric-based AutoRaters. This feature groups failed traces into semantic clusters (for example, Hallucinated Tool Arguments), allowing you to see which behavioral issues are most widespread.

Access failure clusters

Navigate to the Agent Platform > Agents > Evaluationpage.
Select the Evaluationstab.
Click the name of a completed evaluation run to open the report.
If the evaluation detected clusters, they are displayed in the Failure Clusterssection of the report.

How failure clustering works

Trajectory Analysis:The system identifies specific failure points within multi-turn conversation trajectories.
Semantic Embedding:Failed traces are embedded and grouped based on the semantic similarity of their rationales.
Actionable Insights:The system maps the distribution of these "losses" to guide you in refining your system instructions or few-shot examples.

Interpret verdicts and rationales

For any evaluated trace, the system provides a detailed breakdown of the "Judge LLM's" reasoning.

Verdict:A binary Passor Failstatus for each granular rubric check.
Rationale:A natural language explanation of why the agent succeeded or failed. For example, the rationale might highlight that an agent called a booking tool before verifying the user's identity.
Loss Category:In the cluster view, each failure is highlighted against its related rubric, providing context for the error.

Drill down into traces

To troubleshoot a specific failure directly from the conversation path:

In the Google Cloud console, navigate to the Agent Platform > Agentspage.
In the left navigation menu, select Deploymentsand select your agent.
Select the Tracestab.
Select a specific row from the table to open the details flyout.
Select the Evaluationtab.

The following information is available:

Multi-turn Context:For session evaluations, the panel displays the full conversation history, while the evaluation view shows the corresponding metrics for each turn.
Trace Timeline:You can click specific Trace IDsto see the exact sequence of model inputs, tool calls, and responses for that turn.

Analyze evaluation results and failure clusters Stay organized with collections Save and categorize content based on your preferences.