Use this file to discover all available pages before exploring further.
This guide is part of a sequence: it starts with built-in eval templates, then moves to customizing the judge model, then to defining your own evaluation criteria. Here you configure a judge model, select a pre-built evaluation, and run it on real data—specifically, data derived from Phoenix traces.The goal is to go from traced application executions to structured quality signals that can be inspected, compared, and logged back to Phoenix. This guide assumes you already have tracing in place and focuses on using evals to measure correctness and behavior.At its core, an LLM-as-a-judge evaluation combines three things:
The judge model: the LLM that produces the judgment
A prompt template or rubric: the criteria used to make that judgment
Your data: the examples being evaluated
Once you’ve defined what you want to evaluate and selected the data to run on, the next step is configuring the judge model. The choice of model and its invocation settings directly affect how criteria are interpreted and how consistent evaluation results are.This guide walks through how to configure a judge model and run built-in eval templates using Phoenix Evals.Follow along with the following code assets:
Python Tutorial
Companion Python project with runnable examples
TypeScript Tutorial
Companion TypeScript project with runnable examples
Evals need an LLM to act as the judge—the model that applies the rubric to your data. Configuring that judge is the first step. Phoenix Evals is provider-agnostic. You can run evaluations using any supported LLM provider without changing how your evaluators are written.Across both the Python and TypeScript evals libraries, a judge model is represented as a reusable configuration object. This object describes how Phoenix connects to a model provider, including the provider name, model identifier, credentials, and any SDK-specific client configuration.Invocation behavior (temperature, token limits, or other generation controls) is configured separately on the evaluator. This separation makes it possible to reuse the same judge model across multiple evals while tuning behavior per evaluation.The example below illustrates this separation by configuring a judge model independently of any specific evaluator:
Python
TypeScript
from phoenix.evals.llm import LLMllm = LLM( provider="openai", model="gpt-4o", client="openai",)
import { openai } from "@ai-sdk/openai";const base_model = openai("gpt-4o-mini");
In practice, this means you can adjust how a model is called for one eval without affecting others, while keeping provider configuration centralized. For all supported providers and configuration options, see Configuring the LLM.
Phoenix includes a set of built-in eval templates that cover common evaluation tasks such as relevance, correctness, faithfulness, summarization quality, and toxicity. These templates encode a predefined rubric, structured outputs, and defaults that work well for LLM-as-a-judge workflows.You can find all built in templates here.Built-in templates are a good choice when you want reliable signal quickly without designing a rubric from scratch, especially early in iteration or when establishing a baseline.The example below shows a minimal setup using the built-in Correctness eval template with a configured judge model:
Python
TypeScript
from phoenix.evals.metrics import CorrectnessEvaluatorcorrectness_eval = CorrectnessEvaluator(llm=llm)print(correctness_eval.describe())
import { createCorrectnessEvaluator } from "@arizeai/phoenix-evals";const evaluator = createCorrectnessEvaluator({ model: base_model as any,});
Once defined, built-in evaluators can be run on tabular data or trace-derived examples and logged back to Phoenix like any other eval. Because they return structured outputs, results can be compared across runs and combined with other evaluations.
With a judge model and evaluator defined, the next step is running evals on real application data. A common workflow is evaluating traced executions and attaching results back to spans in Phoenix. Once attached, you can inspect failures and edge cases in the UI, compare behavior across runs, and use eval results as inputs to datasets and experiments.1. Export trace spansStart by exporting spans from a Phoenix project into a tabular structure:
Each row represents a span and includes identifiers and attributes captured during execution.2. Prepare Evaluator Inputs
Next, select or transform fields from the exported spans so they match the evaluator’s expected inputs. This often involves extracting nested attributes such as:Ex. attributes.input.value & attributes.output.valueInput mappings help bridge differences between how data is stored in traces and what evaluators expect.
4. Log results back to PhoenixFinally, log evaluation results back to Phoenix as span annotations. Phoenix uses span identifiers to associate eval outputs with the correct execution.
Python
TypeScript
from phoenix.evals.utils import to_annotation_dataframeevaluations = to_annotation_dataframe(dataframe=results_df)Client().spans.log_span_annotations_dataframe(dataframe=evaluations)
Once logged, eval results appear alongside traces in the Phoenix UI, making it possible to analyze execution behavior and quality together.With built-in evals running on traced data, you can now:
Inspect failures and edge cases
Compare behavior across runs
Use eval results as inputs to datasets and experiments
This completes the core loop from tracing → evaluation → analysis.
At this point, you’ve seen how to run evaluations using Phoenix’s built-in eval templates and attach quality signals to real application executions. This provides a fast way to measure behavior and establish baselines using predefined criteria.In the next guides, we’ll build on this foundation by customizing different parts of the evaluation workflow. Specifically, the next page walks through how to define a custom LLM judge, including how to configure model behavior and connect to different providers or endpoints. From there, we’ll move into customizing evaluation templates and defining application-specific criteria.Together, these guides show how to move from out-of-the-box evaluations to fully customized evals tailored to your application.