Skip to main content
TypeScript evaluation library for LLM applications. This package is vendor agnostic and can be used independently of any framework or platform.
Note: This package is in alpha and subject to change.

Installation

npm install @arizeai/phoenix-evals

Usage

Creating Custom Classifiers

Create custom evaluators for tasks like faithfulness detection, relevance scoring, or any binary/multi-class classification:
import { createClassifier } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const promptTemplate = `
In this task, you will be presented with a query, some context and a response. The response is
generated to the question based on the context. The response may contain false information. You
must use the context to determine if the response to the question contains false information,
if the response is unfaithful to the facts. Your objective is to determine whether the response text
contains factual information and is faithful to the context. An 'unfaithful' response refers to
a response that is not based on the context or assumes information that is not available in
the context. Your response should be a single word: either "faithful" or "unfaithful", and
it should not include any other text or characters.

    [BEGIN DATA]
    ************
    [Query]: {{input}}
    ************
    [Context]: {{context}}
    ************
    [Response]: {{output}}
    ************
    [END DATA]

Is the response above faithful or unfaithful based on the query and context?
`;

// Create the classifier
const evaluator = await createClassifier({
  model,
  choices: { faithful: 1, unfaithful: 0 },
  promptTemplate: promptTemplate,
});

// Use the classifier
const result = await evaluator({
  output: "Arize is not open source.",
  input: "Is Arize Phoenix Open Source?",
  context: "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",
});

console.log(result);
// Output: { label: "unfaithful", score: 0 }

Pre-Built Evaluators

The library includes several pre-built evaluators for common evaluation tasks. These evaluators come with optimized prompts and can be used directly with any AI SDK model.
import {
  createFaithfulnessEvaluator,
} from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";
import { anthropic } from "@ai-sdk/anthropic";

const model = openai("gpt-4o-mini");
// or use any other AI SDK provider
// const model = anthropic("claude-3-haiku-20240307");

// Faithfulness Detection
const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model,
});


// Use the evaluators
const result = await faithfulnessEvaluator({
  input: "What is the capital of France?",
  context: "France is a country in Europe. Paris is its capital city.",
  output: "The capital of France is London.",
});

console.log(result);
// Output: { label: "unfaithful", score: 0, explanation: "..." }

Experimentation with Phoenix

This package works seamlessly with @arizeai/phoenix-client to enable experimentation workflows. You can create datasets, run experiments, and trace evaluation calls for analysis and debugging.

Running Experiments

npm install @arizeai/phoenix-client
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";
import { createDataset } from "@arizeai/phoenix-client/datasets";
import { asEvaluator, runExperiment } from "@arizeai/phoenix-client/experiments";

// Create your evaluator
const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o-mini"),
});

// Create a dataset for your experiment
const dataset = await createDataset({
  name: "faithfulness-eval",
  description: "Evaluate the faithfulness of the model",
  examples: [
    {
      input: {
        question: "Is Phoenix Open-Source?",
        context: "Phoenix is Open-Source.",
      },
    },
    // ... more examples
  ],
});

// Define your experimental task
const task = async (example) => {
  // Your AI system's response to the question
  return "Phoenix is not Open-Source";
};

// Create a custom evaluator to validate results
const faithfulnessCheck = asEvaluator({
  name: "faithfulness",
  kind: "LLM",
  evaluate: async ({ input, output }) => {
    // Use the faithfulness evaluator from phoenix-evals
    const result = await faithfulnessEvaluator({
      input: input.question,
      context: input.context,
      output: output,
    });

    return result; // Return the evaluation result
  },
});

// Run the experiment with automatic tracing
runExperiment({
  experimentName: "faithfulness-eval",
  experimentDescription: "Evaluate the faithfulness of the model",
  dataset: dataset,
  task,
  evaluators: [faithfulnessCheck],
});