Skip to main content

What is a task?

A task connects your evaluator to a data source and defines what to score and how often. You create an evaluator once and reuse it across tasks — pointing it at different projects, datasets, or experiments. Results attach automatically and surface in your project or experiment. Most teams start with a one-time backfill on historical data to establish a baseline, then set up an ongoing task from there. Before creating a task, make sure you have traces flowing into Arize and an LLM provider configured. See AI Provider Integrations.
Workflow diagram from create evaluator in Eval Hub through create task with target and sampling, runs over tracing or experiment data with scope, view results with scores and task logs, and investigate with view evals or jump to trace, with a loop back to edit or improve the evaluator

Start from real traces

Before automating, review real interactions in your tracing project to understand where things go wrong. Group failure patterns into a taxonomy — each category can map to an evaluator or filter. To capture those categories as structured labels, see Human review.
Arize AX tracing project showing Playground Traces with summary cards for traffic, span latency, tokens and cost, a traces table with LLM rows and input and output columns, filters and date range, and Ask Alyx open on the right

Create a task

There are several ways to create a task and run your evaluator on traces.
Evaluators page with Evaluator Hub tab and New Task side panel showing task name, project and trace source with an LLM span filter, an added span evaluator, Run Continuously on with 100 percent sampling, and Create Task
Use the arize-evaluator skill to create and trigger tasks via the ax CLI without leaving your editor. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:
  • “Create a continuous task to run my hallucination evaluator on my project”
  • “Trigger a backfill eval run on my project for the last 7 days”
  • “Set up a task that only evaluates LLM spans”
Terminal showing ax tasks create for a Hallucination Monitor continuous task, success with LLM span filter and input output column mapping, and agent follow-up explaining LLM-only span scoring

Task configuration

Sampling rate

RateWhen to use
100%Low-volume or critical applications where you want to evaluate every trace
10–50%High-volume applications balancing cost and coverage
1–5%Very high-volume applications where representative sampling is enough
Start at 10–20% and increase once you have validated your evaluator is working correctly.

Filters

Use filters to target specific subsets of your data:
  • Span kind: Only evaluate specific span types (for example LLM spans)
  • Model name: Only evaluate spans from a specific model
  • Metadata: Only evaluate spans with certain metadata tags
  • Span attributes: Filter on any span attribute
Evaluators page with New Task panel showing target project and traces, a span kind query for LLM spans, Add Evaluator, Run Continuously and sampling, One-Time Backfill, and Advanced options including LLM Override and Enable Tracing

Run evals continuously

For tasks that use Run continuously on new data, evaluators from the Eval Hub (including pre-built LLM judge templates) run on incoming traces on a rolling schedule. When you create a task and add an evaluator, you can pick a template from the hub before mapping columns and saving. On the Evaluators page, the Running Eval Tasks tab lists every task, its target and evaluators, a snapshot of the last few runs, and View Logs when you need execution details.
Evaluators page on Running Eval Tasks tab showing a table of task names, project or dataset targets, attached evaluators, created and last run times, last five runs status pills, and View Logs actions

Viewing results

Once a task runs, evaluation results attach automatically to your spans. Open any trace in the Tracing view and use the evaluation panel on each span to inspect labels, scores, and explanations. To check task status, view run timing, see counts of successes and errors, or troubleshoot a failed run, navigate to the Running Tasks tab on the Evaluators page and open any task. From the logs you can also click View Traces to jump directly to the evaluated spans with the same filters applied.
Tracing project traces table with a Span Evaluations column showing eval labels such as dietary adherence marked correct per trace, plus latency, tokens, and Ask Alyx

Further reading