Use this file to discover all available pages before exploring further.
Pydantic Evals is an evaluation library that provides preset direct evaluations and LLM Judge evaluations. It can be used to run evaluations over dataframes of cases defined with Pydantic models. This guide shows you how to use Pydantic Evals alongside Arize Phoenix to run evaluations on traces captured from your running application.
import os# Add Phoenix API Key for tracingPHOENIX_API_KEY = "ADD YOUR API KEY"os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
Your Phoenix API key can be found on the Keys section of your dashboard.Launch your local Phoenix instance:
pip install arize-phoenixphoenix serve
For details on customizing a local terminal deployment, see Terminal Setup.
For more info on using Phoenix with Docker, see Docker.
Install packages:
pip install arize-phoenix
Launch Phoenix:
import phoenix as pxpx.launch_app()
By default, notebook instances do not have persistent storage, so your traces will disappear after the notebook is closed. See self-hosting or use one of the other deployment options to retain traces.
First, create some example traces by running your AI application. Here’s a simple example:
from openai import OpenAIimport osclient = OpenAI()inputs = [ "What is the capital of France?", "Who wrote Romeo and Juliet?", "What is the largest planet in our solar system?",]def generate_trace(input): client.chat.completions.create( model="gpt-4o-mini", messages=[ { "role": "system", "content": "You are a helpful assistant. Only respond with the answer to the question as a single word or proper noun.", }, {"role": "user", "content": input}, ], )for input in inputs: generate_trace(input)
Create a dataset of test cases using Pydantic Evals:
from pydantic_evals import Case, Datasetcases = [ Case( name="capital of France", inputs="What is the capital of France?", expected_output="Paris" ), Case( name="author of Romeo and Juliet", inputs="Who wrote Romeo and Juliet?", expected_output="William Shakespeare", ), Case( name="largest planet", inputs="What is the largest planet in our solar system?", expected_output="Jupiter", ),]
For more sophisticated evaluation, add an LLM judge:
from pydantic_evals.evaluators import LLMJudgedataset.add_evaluator( LLMJudge( rubric="Output and Expected Output should represent the same answer, even if the text doesn't match exactly", include_input=True, model="openai:gpt-4o-mini", ),)
Once you have evaluation results uploaded to Phoenix, you can:
View evaluation metrics: See overall performance across different evaluation criteria
Analyze individual cases: Drill down into specific examples that passed or failed
Compare evaluators: Understand how different evaluation methods perform
Track improvements: Monitor evaluation scores over time as you improve your application
Debug failures: Identify patterns in failed evaluations to guide improvements
The Phoenix UI will display your evaluation results with detailed breakdowns, making it easy to understand your AI application’s performance and identify areas for improvement.