Analyzing Evals

Quick Summary

Confident AI keeps track of your evaluation histories in both development and deployment and allows you to:

visualize evaluation results
compare and select optimal hyperparameters (eg. prompt templates, model used, etc.) for each test run

Visualize Evaluation Results

Once logged in via deepeval login, all evaluations executed using deepeval test run, evaluate(dataset, metrics), or dataset.evaluate(metrics), will automatically have their results available on Confident.

Compare Hyperparameters

Begin by associating hyperparameters with each test run:

test_example.py
import deepeval
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric

def test_hallucination():
    metric = HallucinationMetric()
    test_case = LLMTestCase(...)
    assert_test(test_case, [metric])


# Although the values in this example are hardcoded,
# you should ideally pass in variables as values to keep things dynamic
@deepeval.set_hyperparameters
def hyperparameters():
    return {
        "chunk_size": 500,
        "temperature": 0,
        "model": "GPT-4",
        "prompt_template": """You are a helpful assistant, answer the following question in a non-judgemental tone.

        Question:
        {question}
        """,
    }

note

This only works if you're running evaluations using deepeval test run. If you're not already using deepeval test run for evaluations, we highly recommend you to start using it.

That's all! All test runs will now log hyperparameters for you to compare and optimize on.

Quick Summary​

Visualize Evaluation Results​

Compare Hyperparameters​

Quick Summary

Visualize Evaluation Results

Compare Hyperparameters