Skip to main content
When your dataset contains labeled ground truth and your experiments produce categorical predictions, you can measure experiment quality with standard binary classification metrics directly in the Arize UI. Select a positive class, and Arize computes metrics by comparing each predicted value against the ground truth — no evaluator code required.

Prerequisites

Before configuring classification metrics, make sure:
  • Your dataset has a column with categorical ground truth labels (e.g., expected_category, true_label).
  • Your experiments produce an output column with predicted labels (e.g., output, predicted_label).
  • Column values are clean categorical strings. Metrics are computed by exact match against the selected positive class.

Configure metrics settings

1

Open Metrics Settings

Navigate to the experiments page for your dataset and click Metrics Settings.
2

Select Ground Truth Column

Choose the column from your dataset that contains the true labels. The dropdown shows all columns available in the dataset version.
3

Select Predicted Column

Choose the column from your experiment outputs that contains the predicted labels. The dropdown shows columns available across the selected experiments.
4

Select Positive Class

Pick the value that represents the positive class for binary metric computation. The dropdown is populated with distinct values from the ground truth column you selected.
Changing the ground truth column resets the positive class selection.
5

Click Done

Metrics are computed automatically for every experiment on the dataset.

Metrics computed

Arize computes the following binary classification metrics using the selected positive class. Rows where either the ground truth or predicted value is null are excluded.
MetricFormulaDescription
Accuracy(TP + TN) / TotalFraction of predictions that match the ground truth
PrecisionTP / (TP + FP)Of all positive predictions, how many are correct
RecallTP / (TP + FN)Of all actual positives, how many were predicted
F12 · TP / (2 · TP + FP + FN)Harmonic mean of Precision and Recall
Where:
  • TP (True Positive) — predicted and ground truth both match the positive class
  • TN (True Negative) — neither predicted nor ground truth match the positive class
  • FP (False Positive) — predicted matches the positive class, ground truth does not
  • FN (False Negative) — ground truth matches the positive class, predicted does not

Viewing results

Once configured, classification metrics appear per experiment in the experiments view. Use the charting view to visualize how metrics compare across experiments, or inspect values in the table headers for a detailed breakdown.
Settings persist per dataset version, so you only need to configure them once. You can compare the same metrics across multiple experiments simultaneously.