Run offline evaluations

Offline evaluation allows you to measure the performance, safety, and quality of your agents by analyzing historical data captured during development or production. You can evaluate individual Traces(single execution paths) or full Sessions(multi-turn conversation histories) against a set of predefined or custom metrics.

Traces vs. sessions

Trace:A factual, immutable record of the agent's behavior, including model inputs, responses, and tool calls. A trace represents a single execution path.
Session:Encompasses the entire multi-turn interaction between a user and an agent. Use sessions to evaluate context retention and conversational flow over time.

Before you begin

To ensure you have the necessary data and environment for offline evaluation, complete the following:

Ensure you have a working Agent Runtimedeployed with Cloud Traceenabled.
Set up a Cloud Storagebucket to store evaluation results. You only need to provide this path once; it will be pre-filled for future runs.
If you plan to use the Agent Platform SDKfor evaluation, initialize the client as described in Evaluate your agents .

Telemetry requirements

Offline evaluation requires your agent to export specific OpenTelemetry signals to provide the necessary context for assessment. These requirements are identical to those for Online Monitors :

Invoke agent span: Must include the following attributes:
- gen_ai.agent.name : The identifier for the agent.
- gen_ai.agent.description : A brief description of the agent's purpose.
- gen_ai.conversation.id : A unique identifier for the specific conversation session.
Inference events: The gen_ai.client.inference.operation.details event must capture:
- gen_ai.input.messages : The prompts sent to the agent.
- gen_ai.output.messages : The responses generated by the agent.
- gen_ai.system_instructions : The underlying system prompts.
- gen_ai.tool.definitions : Metadata about any tools available to the agent.

If you are using the Agent Development Kit, you must enable these telemetry capabilities by setting the following environment variables:

  OTEL_SEMCONV_STABILITY_OPT_IN 
 = 
 'gen_ai_latest_experimental' 
 OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT 
 = 
 'EVENT_ONLY'

Recording media in Cloud Storage

If your agent uses multimodal data, such as images or large documents, then we recommend recording the inputs and outputs in a Cloud Storagebucket instead of embedding them directly in trace spans. Configure the following environment variables to enable this:

  OTEL_INSTRUMENTATION_GENAI_UPLOAD_FORMAT 
 = 
 'jsonl' 
 OTEL_INSTRUMENTATION_GENAI_COMPLETION_HOOK 
 = 
 'upload' 
 OTEL_INSTRUMENTATION_GENAI_UPLOAD_BASE_PATH 
 = 
 'gs://STORAGE_BUCKET_NAME/PATH'

For more information, see Collect multimodal prompts and responses .

Create an evaluation from the registry

In the Google Cloud console, navigate to the Agent Platform > Agents > Evaluationpage.

Go to Evaluation
Click New evaluation.
Select the Tracesor Sessionstab based on your assessment goal.
Use the filter icon and time picker to filter data (for example, by Versionor "Last 2 weeks") and select the specific IDs you want to evaluate.
Click Continue.
(Optional) In the Evaluation namefield, enter a name for the assessment or use the pre-filled default.
In the Output private data pathfield, enter your Cloud Storagebucket URI. After the first use, this path is pre-filled for future runs.
By default, all four core metrics are added. You can add or remove metrics as needed.
Click Evaluate agent.

Evaluate a single trace or session

You can trigger evaluations directly while inspecting individual execution paths:

In the Google Cloud console, navigate to the Agent Platform > Agentspage.
In the left navigation menu, select Deployments.
Select your agent.
Go to Deployments
Select the Tracestab.
Click Session viewor Trace viewto inspect the execution path.
Select a specific row from the table to open the details panel.
Select the Evaluationtab.
If the trace or session has not been evaluated, click Evaluateto run an ad-hoc assessment.

View evaluation results

After the evaluation completes, you can analyze the results to identify performance gaps and systemic issues:

View results for a run:In the Google Cloud console, go to the Agent Platform > Agents > Evaluationpage and select the Evaluationstab. Click an evaluation name to view the detailed report.
Go to Evaluation
Drill down to traces:From a results report, click any row to navigate directly to the associated trace and inspect the reasoning (rationales) behind the scores.

For more information, see Analyze evaluation results .

Run offline evaluations Stay organized with collections Save and categorize content based on your preferences.