Run online evals on traces

What is a task?

A task connects your evaluator to a data source and defines what to score and how often. You create an evaluator once and reuse it across tasks — pointing it at different projects, datasets, or experiments. Results attach automatically and surface in your project or experiment. Most teams start with a one-time backfill on historical data to establish a baseline, then set up an ongoing task from there. Before creating a task, make sure you have traces flowing into Arize AX and an LLM provider configured. See AI Provider Integrations.

Workflow diagram from create evaluator in Eval Hub through create task with target and sampling, runs over tracing or experiment data with scope, view results with scores and task logs, and investigate with view evals or jump to trace, with a loop back to edit or improve the evaluator

Start from real traces

Before automating, review real interactions in your tracing project to understand where things go wrong. Group failure patterns into a taxonomy — each category can map to an evaluator or filter. To capture those categories as structured labels, see Human review.

Arize AX tracing project showing Playground Traces with summary cards for traffic, span latency, tokens and cost, a traces table with LLM rows and input and output columns, filters and date range, and Ask Alyx open on the right

Create a task

There are several ways to create a task and run your evaluator on traces.

Evaluators page with Evaluator Hub tab and New Task side panel showing task name, project and trace source with an LLM span filter, an added span evaluator, Run Continuously on with 100 percent sampling, and Create Task

By Arize Skills
By Alyx
By UI
By Code

Use the arize-evaluator skill to create and trigger tasks via the ax CLI without leaving your editor. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:

“Create a continuous task to run my hallucination evaluator on my project”
“Trigger a backfill eval run on my project for the last 7 days”
“Set up a task that only evaluates LLM spans”

Terminal showing ax tasks create for a Hallucination Monitor continuous task, success with LLM span filter and input output column mapping, and agent follow-up explaining LLM-only span scoring

Use this approach when you need to run evals on large datasets, incorporate external data sources, or want full control over execution and cost. Export your spans, run evals using Phoenix Evals, and log results back to Arize AX via the Python SDK.

1. Export spans

From the Tracing page, click Export and select Export to Notebook to get prefilled export code. Or export programmatically:

import os
from datetime import datetime
from arize import ArizeClient

client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])

primary_df = client.spans.export_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name="your-project-name",
    start_time=datetime.fromisoformat(''),  # prefilled by export
    end_time=datetime.fromisoformat(''),    # prefilled by export
)

2. Run evals

Check which attributes are present with primary_df.columns, then map your input and output columns:

primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]

from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

MY_SAMPLE_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]

    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

llm = LLM(provider="openai", model="gpt-5")

sample_evaluator = create_classifier(
    name="sample-eval",
    llm=llm,
    prompt_template=MY_SAMPLE_TEMPLATE,
    choices={"correct": 1.0, "incorrect": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=primary_df,
    evaluators=[sample_evaluator],
)

It is easier to iterate on your evaluator in a Python script or Colab notebook first. Use the Test in Code button in the task creation interface to get starter code, then copy your evaluator into the UI when ready. For the in-product Create Evaluator layout (imports, class, and sample-data mapping), see Create evaluators.

3. Log results back to Arize AX

Results require four columns: eval.<name>.label, eval.<name>.score, eval.<name>.explanation, and context.span_id. For trace or session evals use the prefixes trace_eval.<name> and session_eval.<name>.

import os
from arize import ArizeClient
from phoenix.evals.utils import to_annotation_dataframe

client = ArizeClient(api_key=os.environ["ARIZE_API_KEY"])
sample_eval_df = to_annotation_dataframe(results_df)

sample_eval_df = sample_eval_df.rename(columns={
    "label": "eval.correctness.label",
    "score": "eval.correctness.score",
    "explanation": "eval.correctness.explanation"
})

client.spans.update_evaluations(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name="your-project-name",
    dataframe=sample_eval_df,
)

Evals can be applied to spans up to 14 days prior to the current day. For older spans contact support@arize.com.

Task configuration

Sampling rate

Rate	When to use
100%	Low-volume or critical applications where you want to evaluate every trace
10–50%	High-volume applications balancing cost and coverage
1–5%	Very high-volume applications where representative sampling is enough

Start at 10–20% and increase once you have validated your evaluator is working correctly.

Filters

Use filters to target specific subsets of your data:

Span kind: Only evaluate specific span types (for example LLM spans)
Model name: Only evaluate spans from a specific model
Metadata: Only evaluate spans with certain metadata tags
Span attributes: Filter on any span attribute

Evaluators page with New Task panel showing target project and traces, a span kind query for LLM spans, Add Evaluator, Run Continuously and sampling, One-Time Backfill, and Advanced options including LLM Override and Enable Tracing

Run evals continuously

For tasks that use Run continuously on new data, evaluators from the Eval Hub (including pre-built LLM judge templates) run on incoming traces on a rolling schedule. When you create a task and add an evaluator, you can pick a template from the hub before mapping columns and saving. On the Evaluators page, the Running Eval Tasks tab lists every task, its target and evaluators, a snapshot of the last few runs, and View Logs when you need execution details.

Evaluators page on Running Eval Tasks tab showing a table of task names, project or dataset targets, attached evaluators, created and last run times, last five runs status pills, and View Logs actions

Viewing results

Once a task runs, evaluation results attach automatically to your spans. Open any trace in the Tracing view and use the evaluation panel on each span to inspect labels, scores, and explanations. To check task status, view run timing, see counts of successes and errors, or troubleshoot a failed run, navigate to the Running Tasks tab on the Evaluators page and open any task. From the logs you can also click View Traces to jump directly to the evaluated spans with the same filters applied.

Tracing project traces table with a Span Evaluations column showing eval labels such as dietary adherence marked correct per trace, plus latency, tokens, and Ask Alyx

​What is a task?

​Start from real traces

​Create a task

​1. Export spans

​2. Run evals

​3. Log results back to Arize AX

​Task configuration

​Sampling rate

​Filters

​Run evals continuously

​Viewing results

​Further reading

What is a task?

Start from real traces

Create a task

1. Export spans

2. Run evals

3. Log results back to Arize AX

Task configuration

Sampling rate

Filters

Run evals continuously

Viewing results

Further reading