Anthropic

The arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. Plug Anthropic Claude in as the judge by passing provider="anthropic" to the LLM(...) wrapper, then build a create_classifier(...) evaluator and run it over a DataFrame with evaluate_dataframe(...).

Prerequisites

Python 3.11+
An ANTHROPIC_API_KEY from the Anthropic Console

Install

pip install arize-phoenix-evals anthropic pandas

Configure credentials

export ANTHROPIC_API_KEY="<your-anthropic-api-key>"

Setup the eval LLM

# eval_setup.py
from phoenix.evals import LLM

# `LLM(provider=..., model=...)` reads the appropriate provider key
# from the environment — ANTHROPIC_API_KEY for Anthropic.
llm = LLM(provider="anthropic", model="claude-sonnet-4-6")

Use claude-haiku-4-5 for a cheaper judge if you’re evaluating large batches; the judge’s job is classification, not generation, so a smaller model is often sufficient.

Run an evaluation

This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.

# example.py
import pandas as pd

from phoenix.evals import LLM, create_classifier, evaluate_dataframe

llm = LLM(provider="anthropic", model="claude-sonnet-4-6")

HALLUCINATION_PROMPT = """\
Determine whether the answer below is factually supported by the
reference. Reply with exactly one of: factual, hallucinated.

Question: {input}
Answer: {output}
Reference: {reference}
"""

evaluator = create_classifier(
    name="hallucination",
    prompt_template=HALLUCINATION_PROMPT,
    llm=llm,
    # `choices` maps each label the LLM may emit to a numeric score.
    # `direction="maximize"` (the default) means higher score is better.
    choices={"factual": 1.0, "hallucinated": 0.0},
)

df = pd.DataFrame([
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
])

results = evaluate_dataframe(dataframe=df, evaluators=[evaluator])

# `hallucination_score` is a Score row (a dict-like with `score`, `label`,
# `explanation`, …) — pull the numeric out for a flat display column.
results["score"] = results["hallucination_score"].apply(lambda r: r["score"])
print(results[["input", "output", "score"]].to_string())

Expected output

                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0

The full returned DataFrame also includes hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.

Troubleshooting

401 from Anthropic. Verify ANTHROPIC_API_KEY is set and has access to the model in the example. Generate a new key at console.anthropic.com.
model_not_found. Anthropic occasionally retires older model aliases. Swap claude-sonnet-4-6 for a current model from the Anthropic models list.
All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s {input}/{output}/{reference} columns expose enough context for the judge to discriminate, and that choices lists every label your prompt asks the LLM to emit.
Some rows fail with timeout / rate-limit. Pass max_retries= to evaluate_dataframe(...) (defaults to 3). For large batches, also pass initial_per_second_request_rate=... to LLM(...) to throttle.
Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use log_evaluations_sync on arize.Client.
Using Anthropic on AWS Bedrock instead. Switch to LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6") and set AWS credentials — see the Amazon Bedrock evals doc for the full pattern.

OpenTelemetry

LLM Providers

Agent Frameworks

Coding Agents

Platforms

Orchestration

Evaluation

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Anthropic Tracing (instrument app calls)

​Prerequisites

​Install

​Configure credentials

​Setup the eval LLM

​Run an evaluation

​Expected output

​Troubleshooting

​Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Anthropic Tracing (instrument app calls)

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources