Create evaluators

From human review to automated evaluation

Once you understand your failure modes through human review, the next step is automating those checks. Evaluators let you measure quality at scale, turning subjective judgments into measurable results so you can track improvements over time and catch regressions early. Once you create an evaluator, you run it over your data using a task. Task setup is covered on the next page.

Arize AX Evaluators UI showing an LLM-as-a-judge evaluator with name and span scope, judge model and prompt template comparing human ground truth to model output, aligned and not aligned choice labels with scores, optimization direction set to maximize, and version history in the sidebar

What is an evaluator

An evaluator looks at your data and returns a structured result, including some combination of label (e.g. correct / incorrect), a numeric score, and an explanation. Evaluators are versioned so every change is tracked. When you create an evaluator you also set its scope (span, trace, session, or experiment), which determines what unit of data it sees and where results appear.

Diagram of an LLM-as-a-judge evaluator showing metadata including scope span trace session and experiment, prompt template with query reference and output variables, data injection into the template, and structured output with score label and explanation after Run eval

An LLM-as-a-judge evaluator defines a name, scope (span, trace, session, or experiment), judge model, and optimization direction. The prompt template references variables like `{query}`, `{reference}`, and `{output}`, which are mapped to your data at runtime. After running, the evaluator returns a structured output: a numeric score, a label (for example Correct), and an explanation. LLM evaluators also have an optimization direction. Maximize when higher scores are better, minimize when lower scores are better. This tells Arize how to color results so you can see at a glance what is performing well and what needs attention.

Diagram of a code evaluator showing metadata with eval column name and scope span trace session and experiment, Python CodeEvaluator class and evaluate method, dataset_row span attribute keys as data inputs, and structured output with score label and explanation after Run eval

A code evaluator runs a Python class with an evaluate method. The dataset_row input is a dictionary of span attributes—common keys include attributes.output.value, attributes.input.value, and attributes.llm.token_count.total. Use .get() to handle missing keys gracefully. The evaluator returns a structured output with a score, label (for example pass), and an explanation.

What kind of eval do I need?

Start with what you learned in error analysis. For each failure mode, ask: is this subjective or deterministic? Subjective or nuanced criteria: use an LLM-as-a-judge evaluator for high-volume checks with stable column mappings, or Harness as a Judge when the agent should read trace context at run time without upfront mappings. Examples: helpfulness, tone, correctness, agent trajectory quality. Objective and rule-based criteria: use a code evaluator. Examples: JSON validation, regex matching, keyword presence. Most applications mix evaluator types. You can attach multiple eval tasks to the same project for layered coverage.

Scope

When you create an evaluator, you define its scope, which tells the evaluator what unit of data to look at and where results appear.

Scope	Use when
Span	The evaluation target is self-contained in one operation—for example, whether this retrieval returned relevant results
Trace	You need to assess reasoning quality or context retention across the full request flow
Session	You want to evaluate the overall effectiveness of a multi-turn conversation

LLM-as-a-Judge

Use an LLM to assess outputs based on a prompt and criteria you define. You can create one from wherever you are in your workflow:

Evaluator Hub: Create and manage evaluators to reuse across any project or experiment.
Tracing: Create an eval directly from a trace, span, or session when you spot something worth measuring.
Datasets and experiments: Set up an eval to score experiment runs against your golden dataset.
Prompt Playground: Test and iterate on your eval or run an eval on your prompt experiments.

Evaluators page with Evaluator Hub tab selected, showing a table of LLM evaluators with scope, judge model, maintainer, and usage

Setup Instructions

Set up an AI provider integration, write your eval template, map variables to your data, and save to the Evaluator Hub. For when to use span, trace, or session scope, see Scope above. You can create an LLM as a judge directly via the UI; you can also get Alyx or Arize Skills to do it for you.

By Arize Skills
By Alyx
By UI

Use the Arize skills plugin in your coding agent and the arize-evaluator skill to create evaluators via the ax CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:

“Create a hallucination evaluator for my project”
“Create an evaluator from blank with correct/incorrect labels”
“Update the prompt on my correctness evaluator”

Coding agent terminal using the arize-evaluator skill and ax CLI to create an evaluator from natural language

Create Via Agent (Skills) modal with install command, API key and space ID setup, and an example prompt for your coding agent

From a trace or span

In the tracing UI, open a trace and use Add Trace Eval in the trace header to score the full trace, or select a span and use Add Span Eval in the span details panel for span-level judges.

Trace detail view with span tree and span input or output, highlighting Add Trace Eval in the header and Add Span Eval in the selected span panel, with Ask Alyx open on the right

Use a pre-built template

Use a pre-built template if a generic quality dimension covers your needs. Arize AX includes tested templates for common scenarios:

Template	What it measures
Hallucination	Outputs containing information not supported by the reference
Relevance	Whether responses address the input question
Toxicity	Harmful or inappropriate content
Helpfulness	How useful the response is to the user
Q&A Correctness	Answer accuracy given reference documents
Summarization	Whether summaries capture the source material
User Frustration	Signs of frustration in conversations
Code Generation	Code correctness and readability
SQL Generation	SQL query correctness
Tool Calling	Function call accuracy and parameter extraction

These templates are built into Phoenix Evals and tested against benchmarked datasets. You can access them directly in the Arize AX UI when creating an evaluation, or use them programmatically in code.

Navigate to New Eval Task and select LLM-as-a-Judge
Click Add Evaluator and select a template
Set the scope — span, trace, or session
Configure your judge model and AI provider (use a different model than the one you’re evaluating)
Map your trace or dataset attributes to the template variables
Click Create

Traces table with span kinds, filters, latency and token summaries, Eval Tasks control, and Ask Alyx panel

Create Evaluator modal for an LLM judge showing Hallucination template, span scope, judge model, prompt rubric, optional test mapping, and Ask Alyx

Create from blank

Create from blank if your application has specific criteria that generic templates can’t capture.

Navigate to New Eval Task and select LLM-as-a-Judge
Click Add Evaluator, then Create From Blank
Name the evaluator and write a prompt template; see below for what makes a successful prompt template.
Define output labels (e.g. correct / incorrect) and scores
Configure the judge model and save

Create Evaluator modal for a new blank evaluator with name and span scope, eval template placeholder with variable hints, choice rows mapping labels to scores, optimization direction, and optional test evaluator panel with dataset and variable mapping

Writing a prompt template

A successful prompt template has four elements:

Define the judge’s role

In the first part of your prompt, define the judge’s role. Avoid framing like “you are an expert evaluator”: it rarely helps and can sometimes make results worse. Instead, focus on giving the judge context: what type of system it is evaluating, what industry or domain that system operates in, and what the judge’s task is. For example, telling the judge “you are identifying issues with the relevance of an agent’s responses so we can improve the experience for our users” establishes the system under evaluation, the quality dimension you care about, and the goal of the evaluation.

Explicit criteria

Avoid ambiguous or aspirational instructions like “a good response” or “a helpful answer”. Focus on explicit instructions: what specific elements of a response would make it helpful? For example, for a financial agent, one criterion might be “Contains a specific buy/sell/hold recommendation”, or for a customer service agent it might be “mentions specific actions to take in the UI to resolve the issue”.Also include criteria for failure: what would make the response not helpful? This is often drawn from inspecting traces.Be careful not to over-specify. Modern LLMs follow instructions very closely, so a long list of rigid rules can constrain the judge in ways you don’t intend. A criterion like “must contain a specific buy/sell/hold recommendation” may be too strict compared to a more open-ended goal like “consider whether the response provides an appropriate next step when the user asks for advice on whether to buy, sell, or hold an asset” — especially when the judge already has the context that it is evaluating a system inside a financial institution.

Include labeled data

Include variable names that will be expanded at runtime into the inputs and outputs of the template, e.g. {input} and {output}. Surround these variables with clear labels to the LLM so that it understands where your instructions end and inputs and outputs begin and end. XML tags are a clear way to mark where blocks begin and end:

<user_query>
{input}
</user_query>

<financial_report>
{output}
</financial_report>

Don’t specify the output format

You don’t need to tell the LLM what labels to output or describe a response format in your prompt. You define the possible responses externally, as the evaluator’s Choices in the UI. AX defines a single tool with the available choices and requires the model to call it. For models that don’t support tool calling, instructions are appended to the system prompt instead. In both cases, the output specification and parsing is handled for you.

Code evaluators

Code evaluators run deterministic Python logic against your trace data. Faster, cheaper, and more consistent than LLM evals for objective checks.

By Arize Skills
By Alyx
By UI
By Code

Use the Arize skills plugin in your coding agent and the arize-evaluator skill to create code evaluators and tasks via the ax CLI without leaving your editor. See the skill doc for supported commands. Then ask your agent:

“Create a code evaluator that checks if the output is valid JSON”
“Set up a regex evaluator that checks for a phone number in the response”

Arize AX includes managed code evaluators for common patterns:

Evaluator	What it checks	Parameters
Matches Regex	Whether text matches a regex pattern	span attribute, pattern
JSON Parseable	Whether the output is valid JSON	span attribute
Contains Any Keyword	Whether any specified keywords appear	span attribute, keywords
Contains All Keywords	Whether all specified keywords appear	span attribute, keywords

Navigate to the Evaluators page and click New Evaluator, then select Code Evaluator.
Enter an evaluator name and set the scope.
Define imports and write your evaluator class.
Configure variable mapping.
Expand Advanced Options if needed.
Click Create.

Create Evaluator modal showing Python code for a span-level evaluator, sample data mapping, and Alyx in the side panel

Custom code evaluators are available on Arize AX Enterprise.

A custom code evaluator is a Python class that extends CodeEvaluator and implements a single evaluate method.

# Note: This example uses Python SDK v7
from typing import Any, Mapping, Optional
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    CodeEvaluator,
    JSONSerializable,
)

class ContainsHelloEvaluator(CodeEvaluator):
    def evaluate(
        self,
        *,
        dataset_row: Optional[Mapping[str, JSONSerializable]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        output = dataset_row.get("attributes.output.value") if dataset_row else None
        text = str(output or "").lower()

        if "hello" in text:
            return EvaluationResult(
                label="pass",
                score=1.0,
                explanation="Output contains 'hello'"
            )

        return EvaluationResult(
            label="fail",
            score=0.0,
            explanation="Output does not contain 'hello'"
        )

The dataset_row dictionary contains span attributes. Common keys include attributes.output.value, attributes.input.value, and attributes.llm.token_count.total. Access values using .get() to handle missing keys gracefully.Your evaluate method must return an EvaluationResult with a label (e.g. “pass” / “fail”), an optional score (e.g. 1.0 / 0.0), and an explanation.You can import any of the following supported packages: numpy, pandas, scipy, pydantic, jellyfish. Contact support to request additional packages.While it’s possible to write code directly in the UI, it’s easier to iterate in a Python script or Colab notebook first. Use the Test in Code button in the task creation interface to get starter code, then copy-paste your evaluator into the UI when ready.

Evaluator Hub

LLM-as-a-judge evaluators are saved to the Evaluator Hub, your centralized place for managing, versioning, and reusing evaluators. The Evaluators page has two tabs: Eval Hub is where evaluators are defined and managed. Create an evaluator once and attach it to any task: online monitoring, offline batch runs, or dataset experiments, without rewriting prompts or reconfiguring models. Every change is tracked with a version history and commit messages so you know what changed and why. Running Tasks is where evaluation tasks execute evaluators against your data. A task connects an evaluator to a data source and runs it on a schedule or as a one-time batch. When you attach an evaluator to a task, you map its template variables to your datasource columns. This is what makes evaluators portable. The same evaluator works across datasets and projects with different schemas, just update the column mappings.

From human review to automated evaluation

What is an evaluator

What kind of eval do I need?

Scope

LLM-as-a-Judge