Log Evaluation Results

Evaluations, which can be considered a form of automated annotation, are logged as annotations inside of Phoenix. Instead of coming from a “HUMAN” source, they are either “CODE” (aka heuristic) or “LLM” kinds. An evaluation must have a name (e.g. “Q&A Correctness”) and its DataFrame must contain identifiers for the subject of evaluation, e.g. a span or a document (more on that below), and values under either the score, label, or explanation columns. An optional metadata column can also be provided.

Connect to Phoenix

Initialize the Phoenix client to connect to your Phoenix instance:

from phoenix.client import Client

# Initialize client - automatically reads from environment variables:
# PHOENIX_BASE_URL and PHOENIX_API_KEY (if using Phoenix Cloud)
client = Client()

# Or explicitly configure for your Phoenix instance:
# client = Client(base_url="https://your-phoenix-instance.com", api_key="your-api-key")

Span Evaluations

A dataframe of span evaluations would look similar to the table below. It must contain span_id as an index or as a column. Once ingested, Phoenix uses the span_id to associate the evaluation with its target span.

span_id	label	score	explanation
5B8EF798A381	correct	1	”this is correct …”
E19B7EC3GG02	incorrect	0	”this is incorrect …”

The evaluations dataframe can be sent to Phoenix as follows. Note that the name and kind of the evaluation can be supplied through the annotation_name annotator_kind parameters, or as columns with the same names in the dataframe.

from phoenix.client import Client

Client().spans.log_span_annotations_dataframe(
    dataframe=qa_correctness_eval_df,
    annotation_name="Q&A Correctness",
    annotator_kind="LLM",
)

Document Evaluations

A dataframe of document evaluations would look something like the table below. It must contain span_id and document_position as either indices or columns. document_position is the document’s (zero-based) index in the span’s list of retrieved documents. Once ingested, Phoenix uses the span_id and document_position to associate the evaluation with its target span and document.

span_id	document_position	label	score	explanation
5B8EF798A381	0	relevant	1	”this is …“
5B8EF798A381	1	irrelevant	0	”this is …”
E19B7EC3GG02	0	relevant	1	”this is …”

The evaluations dataframe can be sent to Phoenix as follows. In this case we name it “Relevance”.

from phoenix.client import Client

Client().spans.log_document_annotations_dataframe(
    dataframe=document_relevance_eval_df,
    annotation_name="Relevance",
    annotator_kind="LLM",
)

Logging Multiple Evaluation DataFrames

Multiple sets of Evaluations can be logged using separate function calls with the new client.

client.spans.log_span_annotations_dataframe(
    dataframe=qa_correctness_eval_df,
    annotation_name="Q&A Correctness",
    annotator_kind="LLM",
)
client.spans.log_document_annotations_dataframe(
    dataframe=document_relevance_eval_df,
    annotation_name="Relevance",
    annotator_kind="LLM",
)
client.spans.log_span_annotations_dataframe(
    dataframe=faithfulness_eval_df,
    annotation_name="Faithfulness",
    annotator_kind="LLM",
)
# ... continue with additional evaluations as needed

Or, if you specify the annotation_name and annotator_kind as columns, you can vertically concatenate the dataframes and upload them all at once.

import pandas as pd

qa_correctness_eval_df["annotation_name"] = "QA Correctness"
qa_correctness_eval_df["annotator_kind"] = "LLM"

faithfulness_eval_df["annotation_name"] = "Faithfulness"
faithfulness_eval_df["annotator_kind"] = "LLM"

annotations_df = pd.concat([qa_correctness_eval_df, faithfulness_eval_df], ignore_index=True)

px_client.spans.log_span_annotations_dataframe(dataframe=annotations_df)

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

Log Evaluation Results

Connect to Phoenix

Span Evaluations

Document Evaluations

Logging Multiple Evaluation DataFrames

Quick Start

Tracing

Evaluation

Datasets & Experiments

Prompts

Settings

Concepts

Resources

​Connect to Phoenix

​Span Evaluations

​Document Evaluations

​Logging Multiple Evaluation DataFrames

Connect to Phoenix

Span Evaluations

Document Evaluations

Logging Multiple Evaluation DataFrames