EvalTask
(
*
,
dataset
:
typing
.
Union
[
pd
.
DataFrame
,
str
,
typing
.
Dict
[
str
,
typing
.
Any
]],
metrics
:
typing
.
List
[
typing
.
Union
[
typing
.
Literal
[
"exact_match"
,
"bleu"
,
"rouge_1"
,
"rouge_2"
,
"rouge_l"
,
"rouge_l_sum"
,
"tool_call_valid"
,
"tool_name_match"
,
"tool_parameter_key_match"
,
"tool_parameter_kv_match"
,
],
vertexai
.
evaluation
.
CustomMetric
,
vertexai
.
evaluation
.
metrics
.
_base
.
_AutomaticMetric
,
vertexai
.
evaluation
.
metrics
.
pointwise_metric
.
PointwiseMetric
,
vertexai
.
evaluation
.
metrics
.
pairwise_metric
.
PairwiseMetric
,
]
],
experiment
:
typing
.
Optional
[
str
]
=
None
,
metric_column_mapping
:
typing
.
Optional
[
typing
.
Dict
[
str
,
str
]]
=
None
,
output_uri_prefix
:
typing
.
Optional
[
str
]
=
""
)
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset Details:
Default dataset column names:
* prompt_column_name: "prompt"
* reference_column_name: "reference"
* response_column_name: "response"
* baseline_model_response_column_name: "baseline_model_response"
Requirement for different use cases:
* Bring-your-own-response: A `response` column is required. Response
column name can be customized by providing `response_column_name`
parameter. If a pairwise metric is used and a baseline model is
not provided, a `baseline_model_response` column is required.
Baseline model response column name can be customized by providing
`baseline_model_response_column_name` parameter. If the `response`
column or `baseline_model_response` column is present while the
corresponding model is specified, an error will be raised.
* Perform model inference without a prompt template: A `prompt` column
in the evaluation dataset representing the input prompt to the
model is required and is used directly as input to the model.
* Perform model inference with a prompt template: Evaluation dataset
must contain column names corresponding to the variable names in
the prompt template. For example, if prompt template is
"Instruction: {instruction}, context: {context}", the dataset must
contain `instruction` and `context` columns.
Metrics Details:
The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval).
Usage Examples:
1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.
```
eval_dataset = pd.DataFrame({
"prompt" : [...],
"reference": [...],
"response" : [...],
"baseline_model_response": [...],
})
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
"bleu",
"rouge_l_sum",
MetricPromptTemplateExamples.Pointwise.FLUENCY,
MetricPromptTemplateExamples.Pairwise.SAFETY
],
experiment="my-experiment",
)
eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
```
2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance. The input column name to the
model is `prompt` and must be present in the dataset.
```
eval_dataset = pd.DataFrame({
"reference": [...],
"prompt" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
experiment="my-experiment",
).evaluate(
model=GenerativeModel("gemini-1.5-pro"),
experiment_run_name="gemini-eval-run"
)
```
3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
```
eval_dataset = pd.DataFrame({
"context" : [...],
"instruction": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
).evaluate(
model=GenerativeModel("gemini-1.5-pro"),
prompt_template="{instruction}. Article: {context}. Summary:",
)
```
4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.
```
from openai import OpenAI
client = OpenAI()
def custom_model_fn(input: str) -> str:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": input}
]
)
return response.choices[0].message.content
eval_dataset = pd.DataFrame({
"prompt" : [...],
"reference": [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
experiment="my-experiment",
).evaluate(
model=custom_model_fn,
experiment_run_name="gpt-eval-run"
)
```
5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.
```
baseline_model = GenerativeModel("gemini-1.0-pro")
candidate_model = GenerativeModel("gemini-1.5-pro")
pairwise_groundedness = PairwiseMetric(
metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
"pairwise_groundedness"
),
baseline_model=baseline_model,
)
eval_dataset = pd.DataFrame({
"prompt" : [...],
})
result = EvalTask(
dataset=eval_dataset,
metrics=[pairwise_groundedness],
experiment="my-pairwise-experiment",
).evaluate(
model=candidate_model,
experiment_run_name="gemini-pairwise-eval-run",
)
```
Properties
dataset
Returns evaluation dataset.
experiment
Returns experiment name.
metrics
Returns metrics.
Methods
EvalTask
EvalTask
(
*
,
dataset
:
typing
.
Union
[
pd
.
DataFrame
,
str
,
typing
.
Dict
[
str
,
typing
.
Any
]],
metrics
:
typing
.
List
[
typing
.
Union
[
typing
.
Literal
[
"exact_match"
,
"bleu"
,
"rouge_1"
,
"rouge_2"
,
"rouge_l"
,
"rouge_l_sum"
,
"tool_call_valid"
,
"tool_name_match"
,
"tool_parameter_key_match"
,
"tool_parameter_kv_match"
,
],
vertexai
.
evaluation
.
CustomMetric
,
vertexai
.
evaluation
.
metrics
.
_base
.
_AutomaticMetric
,
vertexai
.
evaluation
.
metrics
.
pointwise_metric
.
PointwiseMetric
,
vertexai
.
evaluation
.
metrics
.
pairwise_metric
.
PairwiseMetric
,
]
],
experiment
:
typing
.
Optional
[
str
]
=
None
,
metric_column_mapping
:
typing
.
Optional
[
typing
.
Dict
[
str
,
str
]]
=
None
,
output_uri_prefix
:
typing
.
Optional
[
str
]
=
""
)
Initializes an EvalTask.
display_runs
display_runs
()
Displays experiment runs associated with this EvalTask.
evaluate
evaluate
(
*
,
model
:
typing
.
Optional
[
typing
.
Union
[
vertexai
.
generative_models
.
GenerativeModel
,
typing
.
Callable
[[
str
],
str
]
]
]
=
None
,
prompt_template
:
typing
.
Optional
[
str
]
=
None
,
experiment_run_name
:
typing
.
Optional
[
str
]
=
None
,
response_column_name
:
typing
.
Optional
[
str
]
=
None
,
baseline_model_response_column_name
:
typing
.
Optional
[
str
]
=
None
,
evaluation_service_qps
:
typing
.
Optional
[
float
]
=
None
,
retry_timeout
:
float
=
600.0
,
output_file_name
:
typing
.
Optional
[
str
]
=
None
)
-
> vertexai
.
evaluation
.
EvalResult
Runs an evaluation for the EvalTask.