Skip to main content

From human review to automated evaluation

Once you understand your failure modes through human review, the next step is automating those checks. Evaluators let you measure quality at scale, turning subjective judgments into measurable results so you can track improvements over time and catch regressions early. Once you create an evaluator, you run it over your data using a task. Task setup is covered on the next page.
Arize AX Evaluators UI showing an LLM-as-a-judge evaluator with name and span scope, judge model and prompt template comparing human ground truth to model output, aligned and not aligned choice labels with scores, optimization direction set to maximize, and version history in the sidebar

What is an evaluator

An evaluator looks at your data and returns a structured result, including some combination of label (e.g. correct / incorrect), a numeric score, and an explanation. LLM evaluators also have an optimization direction. Maximize when higher scores are better, minimize when lower scores are better. This tells Arize how to color results so you can see at a glance what is performing well and what needs attention. Evaluators are versioned so every change is tracked. When you create an evaluator you also set its scope (span, trace, session, or experiment), which determines what unit of data it sees and where results appear.
Diagram of an LLM-as-a-judge evaluator showing metadata including scope span trace session and experiment, prompt template with query reference and output variables, data injection into the template, and structured output with score label and explanation after Run eval
Diagram of a code evaluator showing metadata with eval column name and scope span trace session and experiment, Python CodeEvaluator class and evaluate method, dataset_row span attribute keys as data inputs, and structured output with score label and explanation after Run eval

What kind of eval do I need?

Start with what you learned in error analysis. For each failure mode, ask: is this subjective or deterministic? Subjective or nuanced criteria: use an LLM-as-a-judge evaluator. Examples: helpfulness, tone, correctness, whether a response addresses the user’s concern. Objective and rule-based criteria: use a code evaluator. Examples: JSON validation, regex matching, keyword presence. Most applications use both. You can attach multiple eval tasks of different types to the same project for layered coverage.

LLM-as-a-Judge

Use an LLM to assess outputs based on a prompt and criteria you define. You can create one from wherever you are in your workflow:
  • Evaluator Hub: Create and manage evaluators to reuse across any project or experiment.
  • Tracing: Create an eval directly from a trace, span, or session when you spot something worth measuring.
  • Datasets and experiments: Set up an eval to score experiment runs against your golden dataset.
  • Prompt Playground: Test and iterate on your eval or run an eval on your prompt experiments.
Evaluators page with Evaluator Hub tab selected, showing a table of LLM evaluators with scope, judge model, maintainer, and usage

Setup Instructions

Set up an AI provider integration, write your eval template, map variables to your data, and save to the Evaluator Hub.
Use the Arize skills plugin in your coding agent and the arize-evaluator skill to create evaluators via the ax CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:
  • “Create a hallucination evaluator for my project”
  • “Create an evaluator from blank with correct/incorrect labels”
  • “Update the prompt on my correctness evaluator”
Terminal showing ax evaluators create for a Hallucination evaluator with Copilot and gpt-4o, success output with evaluator details, and a follow-up prompt to wire the evaluator with Backfill, Continuous, or Both
Create Via Agent (Skills) modal with install command, API key and space ID setup, and an example prompt for your coding agent

Code evaluators

Code evaluators run deterministic Python logic against your trace data. Faster, cheaper, and more consistent than LLM evals for objective checks.
Use the Arize skills plugin in your coding agent and the arize-evaluator skill to create code evaluators and tasks via the ax CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:
  • “Create a code evaluator that checks if the output is valid JSON”
  • “Set up a regex evaluator that checks for a phone number in the response”

Evaluator Hub

LLM-as-a-judge evaluators are saved to the Evaluator Hub, your centralized place for managing, versioning, and reusing evaluators. The Evaluators page has two tabs: Eval Hub is where evaluators are defined and managed. Create an evaluator once and attach it to any task: online monitoring, offline batch runs, or dataset experiments, without rewriting prompts or reconfiguring models. Every change is tracked with a version history and commit messages so you know what changed and why. Running Tasks is where evaluation tasks execute evaluators against your data. A task connects an evaluator to a data source and runs it on a schedule or as a one-time batch. When you attach an evaluator to a task, you map its template variables to your datasource columns. This is what makes evaluators portable. The same evaluator works across datasets and projects with different schemas, just update the column mappings.

Eval best practices

Binary vs Score Evals

Should I Use the Same LLM for my Eval as My Agent?

Eval Cookbooks

Further reading