Skip to main content

Summarization

The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval, the original text refers to the input while the summary is the actual_output.

Required Arguments

To use the SummarizationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

  • input
  • actual_output

Example

Let's take this input and actual_output as an example:

# This is the original text to be summarized
input = """
The 'inclusion score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher inclusion score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The inclusion score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

You can use the SummarizationMetric as follows:

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
threshold=0.5,
model="gpt-4",
assessment_questions=[
"Is the inclusion score based on a percentage of 'yes' answers?",
"Does the score ensure the summary's accuracy with the source?",
"Does a higher score mean a more comprehensive summary?"
]
)

metric.measure(test_case)
print(metric.score)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

There are five optional parameters when instantiating an SummarizationMetric class:

  • [Optional] threshold: the passing threshold, defaulted to 0.5.
  • [Optional] model: a string specifying which of OpenAI's GPT models to use, OR any one of langchain's chat models of type BaseChatModel. Defaulted to 'gpt-4-1106-preview'.
  • [Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the inclusion_score.
  • [Optional] n: the number of questions to generate when calculating the alignment_score and inclusion_score, defaulted to 5.

How is it calculated?

In deepeval, we judge summarization by taking the minimum of the two distinct scores:

  • alignment_score: determines whether the summary contains hallucinated or contradictory information to the original text.
  • inclusion_score: determines whether the summary contains the neccessary information from the original text.

These scores are calculated by generating n closed-ended questions that can only be answered with either a 'yes or a 'no', and calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval's summarization metric was build.

You can access the alignment_score and inclusion_score from a SummarizationMetric as follows:

from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.alignment_score)
print(metric.inclusion_score)
note

Since the summarization score is the minimum of the alignment_score and inclusion_score, a 0 value for either one of these scores will result in a final summarization score of 0.