Run experiment

Run experiments to test model, prompt, agent changes. Experiments can be run via the UI or via code.

Run experiment via UI

1. Test a prompt in playground

First, create a dataset. Load the dataset you created into prompt playground, and run it to see your results. Once you've finished the run, you can save it as an experiment to track your changes.

2. Run an evaluator on your playground experiments

Use evaluators to automatically measure the quality of your experiment results. Once defined, Arize runs it in the background. Evaluators can be either LLM Judges or code-based assessments.

3. Compare experiment results

Each prompt iteration is stored separately, and Arize makes it easy to compare experiment results against each other with Diff Mode.

You can also use Alyx to get automated insights as you compare your experiments, with the ability to both summarize results and highlight key differences across runs.

Run experiment via Code

Check out the API reference for more details:

SDK API reference

1. Define your dataset

You can create a new dataset or use an existing dataset.

from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})

arize_client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)
dataset_id = arize_client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_dataset", dataset_type=GENERATIVE, data=inventions_dataset)

2. Define a task

A task is any function that you want to run on a dataset. The simplest version of a task looks like the following:

def task(dataset_row: Dict):
    return dataset_row

When you create a dataset, each row is stored as a dictionary with attributes you can retrieve within your task. This can be the user input, the expected output for an evaluation task, or metadata attributes.

import openai 

def answer_question(dataset_row) -> str:
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )
    
    return response.choices[0].message.content

Task inputs

The task function can take the following optional arguments for convenience, which will automatically pass dataset_row attributes to your task function. The easiest way to access anything you need is using dataset_row.

Parameter
Description
Dataset Row Attribute
Example

dataset_row

the entire row of the data, including every column as dictionary key

--

def task_fn(dataset_row): ...

input

experiment run input

attributes.input.value

def task_fn(input): ...

expected

the expected output

attributes.output.value

def task_fn(expected): ...

metadata

metadata for the function

attributes.metadata

def task_fn(metadata): ...

3. Define an evaluator (Optional)

You can also optionally define an evaluator to assess your task outputs in experiments. These evaluators can be LLM Judges or Code Evaluators. For example, here’s a simple code evaluator that verifies whether the LLM output aligns with the expected output:

from arize.experimental.datasets.experiments.types import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

4. Run the experiment

Then, use the run_experiment function to run the task function against your dataset, run the evaluation function against the outputs, and log the results and traces to Arize.

arize_client.run_experiment(
    space_id=ARIZE_SPACE_ID, 
    dataset_id=dataset_id,
    task=answer_question, 
    evaluators=[is_correct], #include your evaluation functions here 
    experiment_name="basic-experiment",
    concurrency=10,
    exit_on_error=False,
    dry_run=False,
)

We offer several convenience attributes:

  • concurrency reduces time to complete the experiment.

  • dry_run=True does not log the result to Arize.

  • exit_on_error=True makes it easier to debug when an experiment doesn't run correctly.

Once your experiment has finished running, you can see your experiment results in the Arize UI.

Last updated

Was this helpful?