Evaluate experiment

How to write the functions to evaluate your task outputs in experiments

Evaluate experiment via UI

To run your first evaluation on experiments:

Navigate to Evaluators on your experiment page and find Add Evaluator
Define your Evaluator
Choose the experiments you want to evaluate from the dropdown menu.
View Experiments results

Need help writing a custom evaluator template? Use ✨Alyx to write one for you ✨

Evaluate experiment via Code

Here's the simplest version of an evaluation function:

def is_true(output):
    # output is the task output
    return output == True

Evaluation Inputs

The evaluator function can take the following optional arguments:

Parameter name

Description

Example

dataset_row

the entire row of the data, including every column as dictionary key

def eval(dataset_row): ...

input

experiment run input, which is mapped to attributes.input.value

def eval(input): ...

output

experiment run output

def eval(output): ...

dataset_output

the expected output if available, mapped to attributes.output.value

def eval(dataset_output): ...

metadata

dataset_row metadata, which is mapped to attributes.metadata

def eval(metadata): ...

Evaluation Outputs

We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.

Evaluator Output Type

Example

How it appears in Arize

boolean

True

label = 'True' score = 1.0

float

1.0

score = 1.0

string

"reasonable"

label = 'reasonable'

tuple

(1.0, "my explanation notes")

score = 1.0 explanation = 'my explanation notes'

tuple

("True", 1.0, "my explanation")

label = 'True' score = 1.0 explanation = "my explanation"

EvaluationResult

EvaluationResult(

score=1,

label='reasonable', explanation='explanation'

metadata={}

)

score = 1.0

label='reasonable' explanation = 'explanation' metadata={}

To use EvaluationResult class, use the following import statement:

from arize.experimental.datasets.experiments.types import EvaluationResult
One of label or score must be supplied (you can't have an evaluation with no result).

Here is an example of an evaluator which compares the output to a value in the dataset_row.

from arize.experimental.datasets.experiments.types import EvaluationResult

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

To run the experiment, you can load the evaluator into run_experiment as following:

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID, 
    dataset_id=dataset_id,
    task=answer_question, 
    evaluators=[is_correct], #include your evaluation functions here 
    experiment_name="basic-experiment",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

Create an LLM Evaluator

LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.

Arize supports a large number of LLM evaluators out of the box with LLM Classify: Arize Templates. You can also define custom LLM evaluators.

Here's an example of a LLM evaluator that checks for correctness in the model output:

CORRECTNESS_PROMPT_TEMPLATE = """
You are given an invention (input) and an inventor (output). Determine whether the inventor correctly corresponds to the invention.

[BEGIN DATA]
[Inventor]: {invention}
[Output]: {output}
[END DATA]

Explain your reasoning step by step, then provide a single-word LABEL at the end: either "correct" or "incorrect".

Format:

EXPLANATION: Your reasoning about why the output is correct or incorrect
LABEL: "correct" or "incorrect"
************
"""

Run Evaluation

from phoenix.evals import llm_classify, OpenAIModel
from arize.experimental.datasets.experiments.types import EvaluationResult
import pandas as pd

def correctness_eval(output, dataset_row):
    # Get the original query topic
    invention = dataset_row.get("attributes.output.value")
    
    eval_df = llm_classify(
        dataframe=pd.DataFrame([{"invention": invention, "output": output}]),
        template=CORRECTNESS_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    
    # Map the eval df to EvaluationResult
    label = eval_df["label"][0]
    score = 1 if label == "correct" else 0
    explanation = eval_df["explanation"][0]
    
    return EvaluationResult(label=label, score=score, explanation=explanation)

In this example, correctness_eval evaluates whether the output of an experiment is correct. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.

Once you define your evaluator class, you can use it in your experiment run like this:

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID,
    dataset_id=dataset_id,
    task=answer_question,
    evaluators = [correctness_eval],
    experiment_name="test-experiment",
)

You can customize LLM evaluators to suit your experiment's needs — update the template with your instructions and the rails with the desired output.

Create a Code Evaluator

Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.

Custom Code Evaluators

Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.

For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:

def in_bounds(output):
    return 1 <= output <= 100

By passing the in_bounds function to run_experiment, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.

experiment = arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID,
    dataset_id=DATASET_ID,
    task=answer_question,
    evaluators=[in_bounds],
    experiment_name=experiment_name,
)

Prebuilt Phoenix Code Evaluators

You can also leverage our open-source Phoenix pre-built code evaluators.

Pre-built evaluators in phoenix.experiments.evaluators can be passed directly to the evaluators parameter when running experiments.

Last updated 36 minutes ago

Was this helpful?