Summarization automatic evaluation metrics

Summarization automatic evaluation (autoevaluation) assesses the quality of AI-generated summaries based on accuracy, completeness, and adherence.

Accuracy

Accuracy measures how closely a summary aligns with the factual details of the conversation transcript. For each summary, the autoevaluation determines a correctness percentage, along with a corresponding justification. A low accuracy score means there are factual problems in the summary.

Accuracy results look like the following:

 { 
  
 "decomposition" 
:  
 [ 
  
 { 
  
 "point" 
:  
 "The customer wants to cancel their subscription." 
,  
 "accuracy" 
:  
 "This is accurate. The customer calls to get support of cancelling their subscription." 
,  
 "is_accurate" 
:  
 true 
  
 } 
,  
 { 
  
 "point" 
:  
 "The customer asks about a 
 $30 
 credit." 
,  
 "accuracy" 
:  
 "This is inaccurate. The customer mentioned 
 $10 
 ." 
,  
 "is_accurate" 
:  
 false 
  
 } 
  
 ] 
 } 
  • Each point in the preceding example is a decomposed part of the summary. The binary parameter is_accurate displays the accuracy evaluation result. The accuracy parameter provides the justification.

Adherence

Summarization autoevaluation applies a set of questions to the provided summary. The autoevaluation uses these questions and the conversation transcript to assess the summary's compliance with each instruction. However, summarization autoevaluation relies on Gemini, which might not accurately verify grammatical instructions. So summarization autoevaluation might not accurately assess whether a summary adheres to grammatical instructions.

A low adherence score means that summary fails to adhere to the instructions provided in the summary section's definition. Only summaries that used custom sections can generate an adherence score.

For adherence, summarization autoevaluation recognizes the following two types of summary tasks:

  • Categorical summaries: Provide a categorical value defined in the instructions. For example, the instructions ask for a Sunny or Cloudy response. Autoevaluation checks whether the summary provided only Sunny or Cloudy without descriptive text.
  • Noncategorical summaries: Provide free form text. Autoevaluation checks whether a noncategorical summary follows the instructions defined in the task description.

Adherence results look like the following:

 ( 
Categorical ) 
: { 
  
 "rubrics" 
:  
 [ 
  
 "question" 
:  
 "Does the summary follow the instruction and return only one of the allowed categorical values?" 
,  
 "reasoning" 
:  
 "The summary is not a categorical value. It contains descriptive text instead of providing only one of the allowed categorical values." 
,  
 "is_addressed" 
:  
 "False" 
  
 ] 
 } 
 ( 
Noncategorical ) 
: { 
  
 "rubrics" 
:  
 [ 
  
 { 
  
 "question" 
:  
 "Does the summary follow the instruction 'State the product name being returned'?" 
,  
 "reasoning" 
:  
 "Summary followed instruction. It correctly stated the product name, for example: 'return the \\'Stealth Bomber X5\\' gaming mouse'." 
,  
 "is_addressed" 
:  
 "True" 
  
 } 
  
 ] 
 } 
  • Each question is derived from the provided summary section definition. The binary parameter is_addressed displays the adherence evaluation result. The reasoning parameter provides a justification.

  • If any questions aren't aligned with your goal, the summary section definition of that goal was unclear. You can understand the issue and improve your section definitions.

Completeness

Summarization autoevaluation applies a set of rubrics to assess the completeness of an AI-generated summary based on the instructions in the summary's section definition. A low completeness score means the summary failed to include the important information from the transcript.

Completeness results look like the following:

 { 
  
 "rubrics" 
:  
 [ 
  
 { 
  
 "question" 
:  
 "Does the summary identify that the customer initially considered cancelling their subscription?" 
,  
 "is_addressed" 
:  
 "True" 
  
 } 
,  
 { 
  
 "question" 
:  
 "Does the summary identify that the customer inquired about a previously issued credit?" 
,  
 "is_addressed" 
:  
 "False" 
  
 } 
,  
 { 
  
 "question" 
:  
 "Does the summary mention the specific amount of the credit ( 
 $20 
 )?" 
,  
 "is_addressed" 
:  
 "False" 
  
 } 
  
 ] 
 } 
  • Each question is derived from the provided task description and transcript. The binary parameter is_addressed displays the evaluation result.

  • If any of the questions aren't aligned with your goal, your summary's section definition was unclear. Understand the issue and improve your section definition.

Create a Mobile Website
View Site in Mobile | Classic
Share by: