Evaluate your agent

In the previous guide, you instrumented your app and explored its traces. That works for a handful of test queries - but you can’t read every response yourself. Evaluations solve this. An evaluation is an automated check - either an LLM judging another LLM’s output, or a deterministic code check - that runs on your production data continuously. By the end of this guide, every response will be automatically scored and you’ll be able to filter to find the ones that need attention.

Evaluations overview showing trace list with evaluation score columns and filters — Evaluate trace data

This is Part 2 of the Arize AX Get Started series. You should have completed the Tracing guide first, with traces flowing into your project.

Choose how you want to work

Use Arize Skills to have your coding agent run evaluations from your editor, Alyx for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or Code to run them programmatically.

By Arize Skills
By Alyx
By UI
By Code

Use Arize Skills with your coding agent to create an evaluator, run it on traces as a task, and export spans to inspect failures. Install the skills plugin and follow Set up Arize with AI coding agents for authentication and CLI setup. Then, follow the flow below.

Step 1: Create eval

arize-evaluatorThe skill only covers LLM-as-a-Judge evaluators. In your prompt, name the evaluator, state which template fits what you want to test (for example tool selection, task completion, or hallucination), and tell it which project the evaluator is for and how your span columns map to the template’s inputs. For example, you might say:

Create a hallucination evaluator for my project using the hallucination template. Map the input, output, and context columns to my span attributes.

Note that templates are a starting point - most teams customize the prompt criteria to match their specific rubric. Once the evaluator is created, you can ask your agent to revise it, such as:

Update the evaluator’s criteria: label the output “hallucinated” if it makes any claim that isn’t supported by the provided context, and “factual” only if every claim can be traced back to the context.

Terminal showing Claude Code loading the arize-evaluator skill to create an evaluator using a template, with column mapping and a follow-up revision. — The skill creating an evaluator that uses a hallucination template.

Step 2: Create a task to run your evaluator

arize-evaluatorA task connects an evaluator to your project and defines cadence and sampling. See Run online evals on traces for the full UI and configuration options.For example, you might say:

Set up a task to run my evaluator continuously on incoming traces.

Coding agent terminal: choosing how an evaluator task runs (backfill, continuous on new spans, or both), then creating the evaluator and task with CLI commands — Setting up a task to run an evaluator on incoming traces.

Step 3: See evaluation results on your traces

arize-traceAfter an eval task has written labels to spans, export failures for triage. See Viewing results for where scores appear in the UI.For example, you might say:

Export spans from my project where my evaluator failed this week

Use the Arize AX UI in Evaluators and Traces to configure your project’s evals end to end.

Step 1: Understand the two types of evaluations

Arize AX supports two kinds of evaluators. LLM-as-a-Judge evaluators use an LLM to assess quality - great for subjective dimensions like helpfulness or groundedness that are hard to check with code. Code-based evaluators are deterministic Python checks, ideal for objective conditions like empty responses or keyword presence. We’ll focus on LLM-as-a-Judge for this example. For code-based evaluators, see Create evaluators.

Step 2: Create an evaluator

In the left sidebar, click Evaluators, then New Evaluator. Select the evaluator template that best aligns with what you want to test. You might check whether the agent chose the right tool, completed the task, or returned a hallucinated response.

Evaluator template selection showing Hallucination, Relevance, and other templates — The template picker.

Give your evaluator a name.
Select your LLM provider and model (e.g., OpenAI GPT-4o)
Review the template and customize the criteria to match your rubric, or leave it as-is to get started quickly
Click Create Evaluator

Configuring an evaluator with LLM provider and template — Configuring an evaluator.

Step 3: Create a task to run your evaluator

An evaluator on its own is just a template. To run it on your data, create a task, an automation that applies your evaluator to incoming traces.

Click New Task and select the evaluator type you created.
Click Add Evaluator and select the evaluator you created in the previous step.
Set the data source as your project, the cadence to Run continuously on new incoming data, and sampling to 100%
Map your span attributes to the template variables
Click Create Task

Variable mapping panel showing template variables mapped to span attributes

Running Eval Tasks tab showing an evaluator task active — The Running Eval Tasks tab with an evaluator task active.

Step 4: See evaluation results on your traces

Wait a couple of minutes, then go back to your project. You’ll see evaluation scores on each trace. Filter by score to find failures and click into any trace to see the evaluator’s label, score, and explanation.

Traces list with evaluation score columns showing evaluator labels — A traces list with evaluation score columns.

Trace detail showing an evaluator label with judge explanation — A trace detail with the evaluator's label, score, and explanation.

Run this workflow from the Python SDK, TypeScript SDK, or ax CLI. Some features are in alpha or beta - please check individual reference pages for details.

Step	Python SDK	TypeScript SDK	CLI
Create an evaluator	Link	Link	Link
Create a task to run your evaluator	Link	Link	Link
See evaluation results on your traces	Link	Link	Link

Congratulations!

Every response your app generates is now automatically scored for quality. You’ve gone from “I think it’s working” to “I can measure exactly how well it’s working.” Instead of manually reviewing traces, you can filter to just the ones that failed, and you have an explanation of what went wrong. Your evaluations have probably revealed a pattern: some responses may score poorly because your app did not anticipate certain failures. For example, the system prompt might say “be helpful,” but nothing tells the agent to stick to the information it has, or to say “I don’t know” when it doesn’t. That’s a prompt problem, and it’s exactly what we’ll fix next. Next up: We’ll walk through how to improve your agent using Arize’s Prompt Playground and Experiments features.

Quickstart

Instrument

Observe

Evaluate

Improve

Agents

Machine Learning

Settings

Security

Choose how you want to work

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Understand the two types of evaluations

Step 2: Create an evaluator

Step 3: Create a task to run your evaluator

Step 4: See evaluation results on your traces

Congratulations!

Next: Improve Your Agent

Learn more about Evaluations

​Choose how you want to work

​Step 1: Create eval

​Step 2: Create a task to run your evaluator

​Step 3: See evaluation results on your traces

​Step 1: Create eval

​Step 2: Create a task to run your evaluator

​Step 3: See evaluation results on your traces

​Step 1: Understand the two types of evaluations

​Step 2: Create an evaluator

​Step 3: Create a task to run your evaluator

​Step 4: See evaluation results on your traces

​Congratulations!

Next: Improve Your Agent

Learn more about Evaluations

Choose how you want to work

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Create eval

Step 2: Create a task to run your evaluator

Step 3: See evaluation results on your traces

Step 1: Understand the two types of evaluations

Step 2: Create an evaluator

Step 3: Create a task to run your evaluator

Step 4: See evaluation results on your traces

Congratulations!