SDK Eval Metrics

Phoenix provides pre-built evaluation metrics that can be used out of the box to assess LLM application quality. These metrics are available in both Python and TypeScript and are designed to work seamlessly with Phoenix’s tracing and experiment infrastructure. All LLM evaluation templates are tested against golden datasets and achieve an F1 score of 85% or higher on benchmarks.

LLM Evaluators

LLM evaluators use a judge model to assess the quality of outputs. These are useful for subjective or nuanced evaluations where simple rules don’t suffice.

Faithfulness

Measures whether a response is faithful to (grounded in) the provided context. Detects hallucinations and unsupported claims.

Conciseness

Evaluates whether a response is concise and free of unnecessary content like filler, hedging, and meta-commentary.

Correctness

Evaluates the general correctness of an LLM response.

Document Relevance

Assesses whether retrieved documents are relevant to the input query. Useful for RAG evaluation.

Tool Selection

Determines whether the correct tool was selected for a given context from the available options.

Tool Invocation

Checks if a tool was invoked correctly with proper arguments, formatting, and safe content.

Tool Response Handling

Evaluates whether an agent correctly processed a tool’s result, including error handling, data extraction, and safe information disclosure.

Refusal

Detects when an LLM refuses, declines, or avoids answering a user query.

Code Evaluators

Code evaluators use deterministic logic for evaluation. These are faster, cheaper, and provide consistent results for objective criteria.

Exact Match

Checks if the output exactly matches an expected value. Supports optional normalization.

Matches Regex

Validates that output matches a specified regular expression pattern.

Precision / Recall / F-Score

Computes precision, recall, and F1 scores for comparing predicted vs actual values.

Legacy Evaluators

Legacy evaluators are template-based evaluators from earlier versions of Phoenix. They remain available for backwards compatibility but we recommend using the modern evaluators above for new projects.

Q&A Evaluation

Evaluates Q&A correctness using legacy templates.

Retrieval / RAG Relevance

Legacy document relevance evaluation for RAG systems.

Summarization

Evaluates summary quality using legacy templates.

Toxicity

Legacy toxicity detection evaluation.

SQL Generation

Evaluates SQL query correctness using legacy templates.

Tool Calling (Legacy)

Legacy tool calling evaluation. Consider using Tool Invocation and Tool Selection instead.

Looking to create custom evaluators? See the Building Custom Evaluators guide.

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

SDK Eval Metrics

LLM Evaluators

Faithfulness

Conciseness

Correctness

Document Relevance

Tool Selection

Tool Invocation

Tool Response Handling

Refusal

Code Evaluators

Exact Match

Matches Regex

Precision / Recall / F-Score

Legacy Evaluators

Q&A Evaluation

Retrieval / RAG Relevance

Summarization

Toxicity

SQL Generation

Tool Calling (Legacy)

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​LLM Evaluators

Faithfulness

Conciseness

Correctness

Document Relevance

Tool Selection

Tool Invocation

Tool Response Handling

Refusal

​Code Evaluators

Exact Match

Matches Regex

Precision / Recall / F-Score

​Legacy Evaluators

Q&A Evaluation

Retrieval / RAG Relevance

Summarization

Toxicity

SQL Generation

Tool Calling (Legacy)

LLM Evaluators

Code Evaluators

Legacy Evaluators