Summarization

The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the neccessary details from the original text. In a summarization task within deepeval, the original text refers to the input while the summary is the actual_output.

Required Arguments

To use the SummarizationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

Example

Let's take this input and actual_output as an example:

# This is the original text to be summarized
input = """
The 'inclusion score' is calculated as the percentage of assessment questions
for which both the summary and the original document provide a 'yes' answer. This
method ensures that the summary not only includes key information from the original
text but also accurately represents it. A higher inclusion score indicates a
more comprehensive and faithful summary, signifying that the summary effectively
encapsulates the crucial points and details from the original content.
"""

# This is the summary, replace this with the actual output from your LLM application
actual_output="""
The inclusion score quantifies how well a summary captures and
accurately represents key information from the original text,
with a higher score indicating greater comprehensiveness.
"""

You can use the SummarizationMetric as follows:

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=input, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4",
    assessment_questions=[
        "Is the inclusion score based on a percentage of 'yes' answers?",
        "Does the score ensure the summary's accuracy with the source?",
        "Does a higher score mean a more comprehensive summary?"
    ]
)

metric.measure(test_case)
print(metric.score)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

There are five optional parameters when instantiating an SummarizationMetric class:

[Optional] threshold: the passing threshold, defaulted to 0.5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any one of langchain's chat models of type BaseChatModel. Defaulted to 'gpt-4-1106-preview'.
[Optional] assessment_questions: a list of close-ended questions that can be answered with either a 'yes' or a 'no'. These are questions you want your summary to be able to ideally answer, and is especially helpful if you already know what a good summary for your use case looks like. If assessment_questions is not provided, we will generate a set of assessment_questions for you at evaluation time. The assessment_questions are used to calculate the inclusion_score.
[Optional] n: the number of questions to generate when calculating the alignment_score and inclusion_score, defaulted to 5.

How is it calculated?

In deepeval, we judge summarization by taking the minimum of the two distinct scores:

alignment_score: determines whether the summary contains hallucinated or contradictory information to the original text.
inclusion_score: determines whether the summary contains the neccessary information from the original text.

These scores are calculated by generating n closed-ended questions that can only be answered with either a 'yes or a 'no', and calculating the ratio of which the original text and summary yields the same answer. Here is a great article on how deepeval's summarization metric was build.

You can access the alignment_score and inclusion_score from a SummarizationMetric as follows:

from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(...)
metric = SummarizationMetric(...)

metric.measure(test_case)
print(metric.score)
print(metric.alignment_score)
print(metric.inclusion_score)

note

Since the summarization score is the minimum of the alignment_score and inclusion_score, a 0 value for either one of these scores will result in a final summarization score of 0.

Required Arguments​

Example​

How is it calculated?​

Required Arguments

Example

How is it calculated?