Class EvalTask (1.67.1)

  EvalTask 
 ( 
 * 
 , 
 dataset 
 : 
 typing 
 . 
 Union 
 [ 
 pd 
 . 
 DataFrame 
 , 
 str 
 , 
 typing 
 . 
 Dict 
 [ 
 str 
 , 
 typing 
 . 
 Any 
 ]], 
 metrics 
 : 
 typing 
 . 
 List 
 [ 
 typing 
 . 
 Union 
 [ 
 typing 
 . 
 Literal 
 [ 
 "exact_match" 
 , 
 "bleu" 
 , 
 "rouge_1" 
 , 
 "rouge_2" 
 , 
 "rouge_l" 
 , 
 "rouge_l_sum" 
 , 
 "tool_call_valid" 
 , 
 "tool_name_match" 
 , 
 "tool_parameter_key_match" 
 , 
 "tool_parameter_kv_match" 
 , 
 ], 
 vertexai 
 . 
 evaluation 
 . 
 CustomMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 _base 
 . 
 _AutomaticMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 pointwise_metric 
 . 
 PointwiseMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 pairwise_metric 
 . 
 PairwiseMetric 
 , 
 ] 
 ], 
 experiment 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 metric_column_mapping 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Dict 
 [ 
 str 
 , 
 str 
 ]] 
 = 
 None 
 , 
 output_uri_prefix 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 "" 
 ) 
 

A class representing an EvalTask.

An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.

Dataset Details:

 Default dataset column names:
    * prompt_column_name: "prompt"
    * reference_column_name: "reference"
    * response_column_name: "response"
    * baseline_model_response_column_name: "baseline_model_response"

Requirement for different use cases:
  * Bring-your-own-response: A `response` column is required. Response
      column name can be customized by providing `response_column_name`
      parameter. If a pairwise metric is used and a baseline model is
      not provided, a `baseline_model_response` column is required.
      Baseline model response column name can be customized by providing
      `baseline_model_response_column_name` parameter. If the `response`
      column or `baseline_model_response` column is present while the
      corresponding model is specified, an error will be raised.
  * Perform model inference without a prompt template: A `prompt` column
      in the evaluation dataset representing the input prompt to the
      model is required and is used directly as input to the model.
  * Perform model inference with a prompt template: Evaluation dataset
      must contain column names corresponding to the variable names in
      the prompt template. For example, if prompt template is
      "Instruction: {instruction}, context: {context}", the dataset must
      contain `instruction` and `context` columns. 

Metrics Details:

 The supported metrics descriptions, rating rubrics, and the required
input variables can be found on the Vertex AI public documentation page.
[Evaluation methods and metrics](https://cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval). 

Usage Examples:

 1. To perform bring-your-own-response(BYOR) evaluation, provide the model
responses in the `response` column in the dataset. If a pairwise metric is
used for BYOR evaluation, provide the baseline model responses in the
`baseline_model_response` column.

  ```
  eval_dataset = pd.DataFrame({
          "prompt"  : [...],
          "reference": [...],
          "response" : [...],
          "baseline_model_response": [...],
  })
  eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
            "bleu",
            "rouge_l_sum",
            MetricPromptTemplateExamples.Pointwise.FLUENCY,
            MetricPromptTemplateExamples.Pairwise.SAFETY
    ],
    experiment="my-experiment",
  )
  eval_result = eval_task.evaluate(experiment_run_name="eval-experiment-run")
  ```

2. To perform evaluation with Gemini model inference, specify the `model`
parameter with a `GenerativeModel` instance.  The input column name to the
model is `prompt` and must be present in the dataset.

  ```
  eval_dataset = pd.DataFrame({
        "reference": [...],
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=["exact_match", "bleu", "rouge_1", "rouge_l_sum"],
      experiment="my-experiment",
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      experiment_run_name="gemini-eval-run"
  )
  ```

3. If a `prompt_template` is specified, the `prompt` column is not required.
Prompts can be assembled from the evaluation dataset, and all prompt
template variable names must be present in the dataset columns.
  ```
  eval_dataset = pd.DataFrame({
      "context"    : [...],
      "instruction": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SUMMARIZATION_QUALITY],
  ).evaluate(
      model=GenerativeModel("gemini-1.5-pro"),
      prompt_template="{instruction}. Article: {context}. Summary:",
  )
  ```

4. To perform evaluation with custom model inference, specify the `model`
parameter with a custom inference function. The input column name to the
custom inference function is `prompt` and must be present in the dataset.

  ```
  from openai import OpenAI
  client = OpenAI()
  def custom_model_fn(input: str) -> str:
    response = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "user", "content": input}
      ]
    )
    return response.choices[0].message.content

  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
        "reference": [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[MetricPromptTemplateExamples.Pointwise.SAFETY],
      experiment="my-experiment",
  ).evaluate(
      model=custom_model_fn,
      experiment_run_name="gpt-eval-run"
  )
  ```

5. To perform pairwise metric evaluation with model inference step, specify
the `baseline_model` input to a `PairwiseMetric` instance and the candidate
`model` input to the `EvalTask.evaluate()` function. The input column name
to both models is `prompt` and must be present in the dataset.

  ```
  baseline_model = GenerativeModel("gemini-1.0-pro")
  candidate_model = GenerativeModel("gemini-1.5-pro")

  pairwise_groundedness = PairwiseMetric(
      metric_prompt_template=MetricPromptTemplateExamples.get_prompt_template(
          "pairwise_groundedness"
      ),
      baseline_model=baseline_model,
  )
  eval_dataset = pd.DataFrame({
        "prompt"  : [...],
  })
  result = EvalTask(
      dataset=eval_dataset,
      metrics=[pairwise_groundedness],
      experiment="my-pairwise-experiment",
  ).evaluate(
      model=candidate_model,
      experiment_run_name="gemini-pairwise-eval-run",
  )
  ``` 

Properties

dataset

Returns evaluation dataset.

experiment

Returns experiment name.

metrics

Returns metrics.

Methods

EvalTask

  EvalTask 
 ( 
 * 
 , 
 dataset 
 : 
 typing 
 . 
 Union 
 [ 
 pd 
 . 
 DataFrame 
 , 
 str 
 , 
 typing 
 . 
 Dict 
 [ 
 str 
 , 
 typing 
 . 
 Any 
 ]], 
 metrics 
 : 
 typing 
 . 
 List 
 [ 
 typing 
 . 
 Union 
 [ 
 typing 
 . 
 Literal 
 [ 
 "exact_match" 
 , 
 "bleu" 
 , 
 "rouge_1" 
 , 
 "rouge_2" 
 , 
 "rouge_l" 
 , 
 "rouge_l_sum" 
 , 
 "tool_call_valid" 
 , 
 "tool_name_match" 
 , 
 "tool_parameter_key_match" 
 , 
 "tool_parameter_kv_match" 
 , 
 ], 
 vertexai 
 . 
 evaluation 
 . 
 CustomMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 _base 
 . 
 _AutomaticMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 pointwise_metric 
 . 
 PointwiseMetric 
 , 
 vertexai 
 . 
 evaluation 
 . 
 metrics 
 . 
 pairwise_metric 
 . 
 PairwiseMetric 
 , 
 ] 
 ], 
 experiment 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 metric_column_mapping 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Dict 
 [ 
 str 
 , 
 str 
 ]] 
 = 
 None 
 , 
 output_uri_prefix 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 "" 
 ) 
 

Initializes an EvalTask.

display_runs

  display_runs 
 () 
 

Displays experiment runs associated with this EvalTask.

evaluate

  evaluate 
 ( 
 * 
 , 
 model 
 : 
 typing 
 . 
 Optional 
 [ 
 typing 
 . 
 Union 
 [ 
 vertexai 
 . 
 generative_models 
 . 
 GenerativeModel 
 , 
 typing 
 . 
 Callable 
 [[ 
 str 
 ], 
 str 
 ] 
 ] 
 ] 
 = 
 None 
 , 
 prompt_template 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 experiment_run_name 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 response_column_name 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 baseline_model_response_column_name 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 , 
 evaluation_service_qps 
 : 
 typing 
 . 
 Optional 
 [ 
 float 
 ] 
 = 
 None 
 , 
 retry_timeout 
 : 
 float 
 = 
 600.0 
 , 
 output_file_name 
 : 
 typing 
 . 
 Optional 
 [ 
 str 
 ] 
 = 
 None 
 ) 
 - 
> vertexai 
 . 
 evaluation 
 . 
 EvalResult 
 

Runs an evaluation for the EvalTask.

Create a Mobile Website
View Site in Mobile | Classic
Share by: