Set Up Custom Metrics - Arize AX Docs

You can see your traces and understand agent behavior. But how do you measure what matters to your specific app? The built-in metrics (latency, token count, error rate) are a start, but they won’t tell you your cost per query, what percentage of responses are hallucinated, or how many distinct sessions you’re handling. Custom metrics let you define exactly what you need to track. Each one is a single query in Arize Query Language (AQL) — a SQL-like language — that returns one number, which then shows up in your dashboards and monitors.

How to set up custom metrics

Open your project, click the ... menu in the top right, and select Custom Metrics. From there, you have three ways to create one:

Alyx
UI
Code

Click New Custom Metric. On the right side of the page, Alyx can either recommend a metric or translate one you describe into AQL.Ask for a recommendation if you’re not sure what to track:

“What metrics should I set up for this project?”
“Suggest a quality metric based on the evals I’m running”

Describe a metric in plain English if you already know what you want:

“Count responses flagged as hallucinated”
“Percentage of retriever spans where Relevance is ‘relevant’”
“Total token cost divided by number of distinct sessions”
“A weighted score: 25% QA correctness plus 75% relevance”

Alyx drops the AQL into the SQL Query box. Edit, name, save.

Creating a custom metric with Alyx — natural language prompt on the right generates AQL in the SQL Query box

Create and manage custom metrics via the GraphQL API:

mutation {
  createCustomMetric(input: {
    modelId: "MODEL_ID"
    name: "Hallucination Rate"
    description: "Percentage of responses flagged as hallucinated"
    metric: "SELECT COUNT(*) FILTER (WHERE \"eval.Hallucination.label\" = 'hallucinated') / COUNT(*) * 100 FROM model"
  }) {
    customMetric { id name }
  }
}

See the full Custom Metrics GraphQL API reference.

Common patterns

Wondering what to put in the SQL Query box? Most custom metrics fall into one of five shapes. Each section below has a template you can adapt plus a concrete example. Every example starts with a -- description comment, which is the format Alyx generates and a useful habit for readability.

1. Count of an eval label

How many times did a specific label occur? Pair COUNT(*) with a FILTER (WHERE ...) clause.

SELECT COUNT(*) FILTER (WHERE "eval.<NAME>.label" = '<LABEL>')
FROM model

Example — count of hallucinated responses

-- Count of responses flagged as hallucinated
SELECT COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated')
FROM model

2. Rate (% of total)

What percentage of traffic has a given eval label, status, or attribute? Divide a filtered count by an unfiltered count.

SELECT
    COUNT(*) FILTER (WHERE <condition>) /
    COUNT(*) FILTER (WHERE <denominator scope>) * 100
FROM model

Example — hallucination rate

-- Hallucination rate: % of evaluated responses flagged as hallucinated
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model

Filter the denominator to != null so spans without an eval don’t inflate your total.

Example — error rate for a specific tool

-- Error rate for the fetch_inventory tool
SELECT
    COUNT(*) FILTER (WHERE "status_code" = 'ERROR') /
    COUNT(*) * 100
FROM model
WHERE "attributes.tool.name" = 'fetch_inventory'

3. Weighted composite score

Combine multiple eval scores into one number — useful when quality depends on several dimensions (e.g., correctness and relevance).

SELECT
    <weight_1> * AVG("eval.<NAME_1>.score") +
    <weight_2> * AVG("eval.<NAME_2>.score")
FROM model

Example — 25% QA correctness + 75% relevance

-- Weighted quality: 25% QA correctness + 75% relevance
SELECT
    0.25 * AVG("eval.QA_Correctness.score") +
    0.75 * AVG("eval.Relevance.score")
FROM model

Weights should sum to 1. Scores must be numeric — use .score, not .label.

4. Cost and usage

Turn token counts into dollars, count distinct sessions, and roll them up together or separately. Total cost

-- Total token cost across all LLM calls
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
)
FROM model

Replace 2.5 and 10 with your model’s per-million input/output token rates. Distinct sessions

-- Count of distinct sessions
SELECT APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model

Cost per session

-- Total token cost divided by number of distinct sessions
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
) / APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model

Evaluation cost estimate — estimate how much it costs to run an eval over your traces

-- Eval cost: inputs + outputs + eval prompt template + eval output
SELECT
    (SUM("attributes.llm.token_count.prompt" + "attributes.llm.token_count.completion" + 229) * 0.00000025) +
    (COUNT(*) * (104 * 0.00001))
FROM model

Replace 229 with your eval prompt template’s token length, 104 with the estimated output token length, and the rates with your eval model’s per-token input/output costs.

5. Eval quality against human labels

When you have both auto-evals and human annotations, measure how well the eval matches the human ground truth.

SELECT
    PRECISION(
        predicted = "eval.<EVAL_NAME>.label",
        actual = "annotation.<ANNOTATION_NAME>.label",
        pos_class = '<LABEL>'
    )
FROM model

Example — precision of a QA correctness eval

-- Precision of QA correctness eval vs. human annotation
SELECT
    PRECISION(
        predicted = "eval.QA_Correctness_Eval.label",
        actual = "annotation.Judge Correctness.label",
        pos_class = 'correct'
    )
FROM model

RECALL(...), F1(...), TP(...), FP(...) follow the same signature. See Metric functions below for the full catalog.

AQL syntax

The patterns above cover most cases. When you need to adapt or extend them, this is the full AQL vocabulary. A metric is a single query that aggregates rows from model (the span table). You pick your dimensions, choose an aggregate or metric function, scope which rows count with WHERE or FILTER, and use CASE or operators inside expressions. Each subsection below walks one of those pieces.

Query shape

Every metric starts with this template. SELECT must return one value — there is no GROUP BY in AQL. WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate.

-- Template
SELECT
  < AGG_FUNC | METRIC_FUNC >(exprs)
  [FILTER (WHERE exprs)]
  [CASE WHEN ... END]
  [ <OPERATOR> { < AGG_FUNC | ARIZE_METRIC >(exprs) | constant }] …
FROM model
[WHERE (exprs)]

-- Example — hallucination rate
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model

Constants, expressions, and comments

The general building blocks you use inside a query.

Constants — numbers (int or float) or strings in single quotes ('active').
Expressions — math/boolean operators combining dimensions and constants; no aggregate inside. Can nest: A * (B + C).
Comments — -- single line (CMD+/) or /* multi-line */.

Dimensions

A dimension is any span property, feature, tag, prediction, or actual — string or numeric. Wrap any column with dots or spaces in double quotes. Here are the ones you’ll reference most on trace data:

Column	What it captures
`"status_code"`	`OK` or `ERROR` on each span
`"attributes.openinference.span.kind"`	`LLM`, `TOOL`, `RETRIEVER`, `CHAIN`, `AGENT`, `EMBEDDING`
`"attributes.llm.token_count.prompt"` / `.completion`	Token counts per LLM span
`"attributes.llm.model_name"`	Model used (e.g. `gpt-4o-mini`)
`"attributes.tool.name"`	Name of the tool called on a TOOL span
`"attributes.session.id"`	Session identifier
`"eval.<name>.label"` / `.score`	Eval results
`"annotation.<name>.label"`	Human labels

Aggregation functions

Reduce many rows to one number. Every metric must have at least one aggregation or metric function.

Function	Description	Type
`COUNT(*)`	Counts the number of rows	n/a
`APPROX_COUNT_DISTINCT(exprs)`	Counts the unique values of `exprs`	String
`SUM(exprs)`	Sums the value across rows	Numeric
`AVG(exprs)`	Averages the value across rows	Numeric
`APPROX_QUANTILE(exprs, quantile=<decimal>)`	Approximate quantile; `quantile` must be between 0 and 1 inclusive	Numeric
`MIN(exprs)`	Minimum across rows	Numeric
`MAX(exprs)`	Maximum across rows	Numeric

Metric functions

Beyond basic aggregates, AQL ships pre-built statistical metrics. All take keyword arguments. Omit pos_class= to use the model’s configured positive class. See the concrete example in Common patterns #5.

Classification

Function	Signature	Returns
`PRECISION`	`PRECISION(actual=<col>, predicted=<col>, pos_class='<label>')`	Precision for the positive class
`RECALL`	`RECALL(actual=<col>, predicted=<col>, pos_class='<label>')`	Recall for the positive class
`F1`	`F1(actual=<col>, predicted=<col>, pos_class='<label>')`	F1 score (harmonic mean of precision and recall)
`F_BETA`	`F_BETA(actual=<col>, predicted=<col>, pos_class='<label>', beta=<n>)`	F-score weighted by `beta` (default 1). `beta=0` → precision; `beta→∞` → recall
`TP` / `FP` / `TN` / `FN`	Same signature as `PRECISION`	True / false positive / negative rate
`ACCURACY`	`ACCURACY(actual=<col>, predicted=<col>)`	Overall accuracy
`LOG_LOSS`	`LOG_LOSS(actual=<string col>, predicted=<numeric col>, pos_class='<label>')`	Log loss
`AUC`	`AUC(actual=<col>, predicted=<col>)`	ROC AUC
`MAX_PRECISION`	`MAX_PRECISION(actual=<col>, predicted=<col>, pos_class='<label>', group_by_column='<col>')`	Groups by `group_by_column`, keeps highest-score row per group, computes precision

Regression

All four share the signature FUNC(actual=<col>, predicted=<col>).

Function	Returns
`MAE`	Mean absolute error
`MAPE`	Mean absolute percentage error
`MSE`	Mean squared error
`RMSE`	Root mean squared error

Ranking

NDCG(ranking_relevance=<col>, prediction_group_id=<col>, rank=<col>, omit_zero_relevance=True, k=10)

Normalized Discounted Cumulative Gain. omit_zero_relevance controls whether 0-relevance rows affect averaging.

Filters — WHERE and FILTER

Scope which rows the aggregate sees. WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate. Identical syntax inside. Subqueries are not supported in either.

-- WHERE: every row the query sees must match
SELECT COUNT(*)
FROM model
WHERE "attributes.openinference.span.kind" = 'LLM'

-- FILTER: the filter only applies to the preceding COUNT
SELECT
    COUNT(*) FILTER (WHERE "eval.Relevance.label" = 'relevant') /
    COUNT(*) * 100
FROM model

Conditionals — CASE

Classify rows into buckets before aggregating. Expressions evaluated in order; the first true branch wins. ELSE gives a default — without one and no match, returns NULL.

-- Example — count errors by bucketing spans into 1/0
SELECT SUM(CASE WHEN "status_code" = 'ERROR' THEN 1 ELSE 0 END)
FROM model

-- Searched CASE
CASE
  WHEN exprs THEN result1
  [ WHEN exprs THEN result2 ]
  [ ELSE default ]
END

-- Simple CASE — equality against a common expression
CASE expression
  WHEN value THEN result
  [WHEN ...]
  [ELSE result]
END

Operators

Math and comparison building blocks used inside expressions. Numeric — apply to numeric dimensions only.

Operator	Description
`A + B` / `A - B` / `A * B` / `A / B`	Arithmetic (integer division cast to `FLOAT`)
`ABS(A)`	Absolute value
`CEIL` / `FLOOR`	Round up / down to nearest integer
`COS`, `SIN`, `TAN`, `TANH`	Trigonometry
`LN`, `LOG` / `LOG10`	Natural log (base `e`) / log base 10
`SQRT` / `CBRT`	Square root / cube root
`GREATEST` / `LEAST`	Greatest / least of the arguments; `NULL` if any argument is `NULL`

Comparison — apply to strings and numerics.

Operator	Description
`A = B` / `A <> B` / `A != B`	Equal / not equal
`A > B` / `A >= B` / `A < B` / `A <= B`	Ordered comparisons
`A IS NULL` / `A IS NOT NULL`	Null checks

Next step

You’ve defined what to measure. Now put those metrics on a dashboard so you can see them at a glance:

​How to set up custom metrics

​Common patterns

​1. Count of an eval label

​2. Rate (% of total)

​3. Weighted composite score

​4. Cost and usage

​5. Eval quality against human labels

​AQL syntax

​Query shape

​Constants, expressions, and comments

​Dimensions

​Aggregation functions

​Metric functions

​Filters — WHERE and FILTER

​Conditionals — CASE

​Operators

​Next step

Next: Set Up Dashboards

How to set up custom metrics

Common patterns

1. Count of an eval label

2. Rate (% of total)

3. Weighted composite score

4. Cost and usage

5. Eval quality against human labels

AQL syntax

Query shape

Constants, expressions, and comments

Dimensions

Aggregation functions

Metric functions

Filters — WHERE and FILTER

Conditionals — CASE

Operators

Next step