Skip to main content

Why use Code Evals?

When your evaluation criteria is deterministic and clear, code-based evaluators provide a consistent and efficient way to assess results. They are useful when you need to check for objective conditions, such as whether a keyword appears, a URL is valid, or a format follows a rule. Arize offers off-the-shelf code evaluators for common evaluation tasks. When you need more control, you can create custom evaluators that align with your unique business logic or quality criteria.
Code evaluators are defined inline when creating a task. Reusable code evaluators in the Eval Hub are coming soon.

Create a Code Eval

To create a code evaluator, choose Code Evaluator when creating a new Evaluation Task. Now, the evaluator can be created in just 3 steps:
  1. Name the task and define the data it will be run on.
  1. Sampling Rate (%): Define the percentage of data the task should run on (0–100).
    1. Sampling is applied at the highest evaluator scope in the task:
      session > trace > span
    2. Lower-level evaluators will run on all matching data within that sampled set
  2. Task Filters allow you to specify the data this task will run on. This matches spans, or traces/sessions that contain matching spans.
  3. When running on historical data, the maximum number of items is based on the highest eval scope
  1. Provide a unique Eval Column Name for the evaluator in plaintext. Ensure that this name is distinct from other evaluators across all tasks. Here, you can also set Evaluator Scope and Filters.
  2. Define any required parameters for the selected Code Evaluator
Code Evals Demo

Arize Managed Code Evals

Arize manages a set of off-the-shelf code evaluators on your behalf. Simply select the evaluator name from a drop-down and the evaluator code will be provided. Users can customize the evaluators by specifying the arguments that should be passed in as parameters. Currently, we support all of the evaluators below and new evaluators can be added upon request.
Evaluator What it’s Used ForParameters
Matches RegexChecks whether the text matches a specified regex pattern
  • span attribute: which span attribute to look at
  • pattern: The regex pattern used for matching against the span attribute value.
EvalDescriptionParameters
Matches RegexChecks whether the text matches a specified regex pattern
  • span attribute: which span attribute to look at
  • pattern: The regex pattern used for matching against the span attribute value.
JSON ParseableChecks whether the LLM data is a valid JSON-parsable string
  • span attribute: which span attribute to look at
Contains any KeywordChecks whether any specified keywords are present in the LLM data
  • span attribute: which span attribute to look at
  • keywords: A list of keyword strings to search for in the span attribute. If any keyword matches, the evaluator will flag the data as a match.
Contains all Keywordschecks that all keywords are present
  • span attribute: which span attribute to look at
  • keywords: A list of keyword strings to search for in the span attribute. If any keyword matches, the evaluator will flag the data as a match.

Custom Code Evaluators

Custom Code Evaluators are only available in Arize AX Enterprise. Request a demo here.
Custom Code Evals allow you to define your own evaluation logic in Python (with JavaScript coming soon) to score and label LLM traces based on span attributes. This is ideal for use cases that require highly customized and deterministic rules—such as business logic validation, structured output parsing, or expected keyword presence. Once you select CustomArizeEvaluator from the “Select an Eval” drop-down, you’ll define the logic in the right-hand panel of the task creation interface.
Custom Code Evals

Writing a Custom Code Evaluator

A custom code evaluator is a Python class that extends CodeEvaluator and implements a single evaluate method. Below is a complete example followed by a step-by-step breakdown.

Complete Example

This evaluator checks whether the LLM output contains the word “hello”:
# Note: This example uses Python SDK v7
from typing import Any, Mapping, Optional
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    CodeEvaluator,
    JSONSerializable,
)

class ContainsHelloEvaluator(CodeEvaluator):
    def evaluate(
        self,
        *,
        dataset_row: Optional[Mapping[str, JSONSerializable]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        output = dataset_row.get("attributes.output.value") if dataset_row else None
        text = str(output or "").lower()

        if "hello" in text:
            return EvaluationResult(
                label="pass",
                score=1.0,
                explanation="Output contains ‘hello’"
            )

        return EvaluationResult(
            label="fail",
            score=0.0,
            explanation="Output does not contain ‘hello’"
        )

Step-by-Step Breakdown

Step 1: Imports Every custom evaluator needs these three imports from the Arize SDK:
# Note: This example uses Python SDK v7
from typing import Any, Mapping, Optional
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,  # The object you return with your eval results
    CodeEvaluator,     # The base class your evaluator extends
    JSONSerializable,  # Type alias used in the method signature
)
You can also import any of the following supported packages:
numpy, pandas, scipy, pyarrow, pydantic, jellyfish
If you need an additional package, contact the customer support team. Step 2: Define your evaluator class Create a class that extends CodeEvaluator and implements the evaluate method. The method signature must match exactly:
class MyEvaluator(CodeEvaluator):
    def evaluate(
        self,
        *,
        dataset_row: Optional[Mapping[str, JSONSerializable]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
  • The * forces all arguments to be keyword-only.
  • dataset_row is a dictionary containing the span attributes for the row being evaluated.
  • The method must return an EvaluationResult.
Step 3: Read span data from dataset_row The dataset_row dictionary contains span attributes. Common keys include:
KeyDescription
attributes.output.valueThe LLM output text
attributes.input.valueThe LLM input text
attributes.llm.token_count.totalTotal token count
Access values using .get() to handle missing keys gracefully:
output = dataset_row.get("attributes.output.value") if dataset_row else None
Step 4: Return an EvaluationResult Your evaluate method must return an EvaluationResult with these fields:
FieldTypeDescription
labelstrA category label (e.g., "pass", "fail", "error")
scoreOptional[float]A numeric score (e.g., 1.0 for pass, 0.0 for fail), or None
explanationstrA human-readable explanation of the result
return EvaluationResult(
    label="pass",
    score=1.0,
    explanation="Output meets the criteria"
)

Testing Locally

While it’s possible to write the code directly in the UI, it’s typically easier to iterate in a Python script or Colab notebook. Use the Test in Code button in the task creation interface to get starter code that includes your span attributes, evaluator class, and imports.
Once you’re seeing the desired results locally, copy-paste your updated evaluator class, span attributes, and imports into the UI. You’re now ready to kick off your task!