When your dataset contains labeled ground truth and your experiments produce categorical predictions, you can measure experiment quality with standard binary classification metrics directly in the Arize AX UI. Select a positive class, and Arize AX computes metrics by comparing each predicted value against the ground truth — no evaluator code required.

Prerequisites

Before configuring classification metrics, make sure:

Your dataset has a column with categorical ground truth labels (e.g., expected_category, true_label).
Your experiments produce an output column with predicted labels (e.g., output, predicted_label).
Column values are clean categorical strings. Metrics are computed by exact match against the selected positive class.

Configure metrics settings

Open Metrics Settings

Navigate to the experiments page for your dataset and click Metrics Settings.

Select Ground Truth Column

Choose the column from your dataset that contains the true labels. The dropdown shows all columns available in the dataset version.

Select Predicted Column

Choose the column from your experiment outputs that contains the predicted labels. The dropdown shows columns available across the selected experiments.

Select Positive Class

Pick the value that represents the positive class for binary metric computation. The dropdown is populated with distinct values from the ground truth column you selected.

Changing the ground truth column resets the positive class selection.

Click Done

Metrics are computed automatically for every experiment on the dataset.

Metrics computed

Arize AX computes the following binary classification metrics using the selected positive class. Rows where either the ground truth or predicted value is null are excluded.

Metric	Formula	Description
Accuracy	(TP + TN) / Total	Fraction of predictions that match the ground truth
Precision	TP / (TP + FP)	Of all positive predictions, how many are correct
Recall	TP / (TP + FN)	Of all actual positives, how many were predicted
F1	2 · TP / (2 · TP + FP + FN)	Harmonic mean of Precision and Recall

Where:

TP (True Positive) — predicted and ground truth both match the positive class
TN (True Negative) — neither predicted nor ground truth match the positive class
FP (False Positive) — predicted matches the positive class, ground truth does not
FN (False Negative) — ground truth matches the positive class, predicted does not

Viewing results

Once configured, classification metrics appear per experiment in the experiments view. Use the charting view to visualize how metrics compare across experiments, or inspect values in the table headers for a detailed breakdown.

Settings persist per dataset version, so you only need to configure them once. You can compare the same metrics across multiple experiments simultaneously.

​Prerequisites

​Configure metrics settings

​Metrics computed

​Viewing results

Prerequisites

Configure metrics settings

Metrics computed

Viewing results