Skip to main content
You can see your traces and understand agent behavior. But how do you measure what matters to your specific app? The built-in metrics (latency, token count, error rate) are a start, but they won’t tell you your cost per query, what percentage of responses are hallucinated, or how many distinct sessions you’re handling. Custom metrics let you define exactly what you need to track. Each one is a single query in Arize Query Language (AQL) — a SQL-like language — that returns one number, which then shows up in your dashboards and monitors.

How to set up custom metrics

Open your project, click the ... menu in the top right, and select Custom Metrics. From there, you have three ways to create one:
Click New Custom Metric. On the right side of the page, Alyx can either recommend a metric or translate one you describe into AQL.Ask for a recommendation if you’re not sure what to track:
  • “What metrics should I set up for this project?”
  • “Suggest a quality metric based on the evals I’m running”
Describe a metric in plain English if you already know what you want:
  • “Count responses flagged as hallucinated”
  • “Percentage of retriever spans where Relevance is ‘relevant’”
  • “Total token cost divided by number of distinct sessions”
  • “A weighted score: 25% QA correctness plus 75% relevance”
Alyx drops the AQL into the SQL Query box. Edit, name, save.
Creating a custom metric with Alyx — natural language prompt on the right generates AQL in the SQL Query box

Common patterns

Wondering what to put in the SQL Query box? Most custom metrics fall into one of five shapes. Each section below has a template you can adapt plus a concrete example. Every example starts with a -- description comment, which is the format Alyx generates and a useful habit for readability.

1. Count of an eval label

How many times did a specific label occur? Pair COUNT(*) with a FILTER (WHERE ...) clause.
SELECT COUNT(*) FILTER (WHERE "eval.<NAME>.label" = '<LABEL>')
FROM model
Example — count of hallucinated responses
-- Count of responses flagged as hallucinated
SELECT COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated')
FROM model

2. Rate (% of total)

What percentage of traffic has a given eval label, status, or attribute? Divide a filtered count by an unfiltered count.
SELECT
    COUNT(*) FILTER (WHERE <condition>) /
    COUNT(*) FILTER (WHERE <denominator scope>) * 100
FROM model
Example — hallucination rate
-- Hallucination rate: % of evaluated responses flagged as hallucinated
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model
Filter the denominator to != null so spans without an eval don’t inflate your total.
Example — error rate for a specific tool
-- Error rate for the fetch_inventory tool
SELECT
    COUNT(*) FILTER (WHERE "status_code" = 'ERROR') /
    COUNT(*) * 100
FROM model
WHERE "attributes.tool.name" = 'fetch_inventory'

3. Weighted composite score

Combine multiple eval scores into one number — useful when quality depends on several dimensions (e.g., correctness and relevance).
SELECT
    <weight_1> * AVG("eval.<NAME_1>.score") +
    <weight_2> * AVG("eval.<NAME_2>.score")
FROM model
Example — 25% QA correctness + 75% relevance
-- Weighted quality: 25% QA correctness + 75% relevance
SELECT
    0.25 * AVG("eval.QA_Correctness.score") +
    0.75 * AVG("eval.Relevance.score")
FROM model
Weights should sum to 1. Scores must be numeric — use .score, not .label.

4. Cost and usage

Turn token counts into dollars, count distinct sessions, and roll them up together or separately. Total cost
-- Total token cost across all LLM calls
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
)
FROM model
Replace 2.5 and 10 with your model’s per-million input/output token rates. Distinct sessions
-- Count of distinct sessions
SELECT APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model
Cost per session
-- Total token cost divided by number of distinct sessions
SELECT SUM(
    "attributes.llm.token_count.prompt" * 2.5 / 1000000 +
    "attributes.llm.token_count.completion" * 10 / 1000000
) / APPROX_COUNT_DISTINCT("attributes.session.id")
FROM model
Evaluation cost estimate — estimate how much it costs to run an eval over your traces
-- Eval cost: inputs + outputs + eval prompt template + eval output
SELECT
    (SUM("attributes.llm.token_count.prompt" + "attributes.llm.token_count.completion" + 229) * 0.00000025) +
    (COUNT(*) * (104 * 0.00001))
FROM model
Replace 229 with your eval prompt template’s token length, 104 with the estimated output token length, and the rates with your eval model’s per-token input/output costs.

5. Eval quality against human labels

When you have both auto-evals and human annotations, measure how well the eval matches the human ground truth.
SELECT
    PRECISION(
        predicted = "eval.<EVAL_NAME>.label",
        actual = "annotation.<ANNOTATION_NAME>.label",
        pos_class = '<LABEL>'
    )
FROM model
Example — precision of a QA correctness eval
-- Precision of QA correctness eval vs. human annotation
SELECT
    PRECISION(
        predicted = "eval.QA_Correctness_Eval.label",
        actual = "annotation.Judge Correctness.label",
        pos_class = 'correct'
    )
FROM model
RECALL(...), F1(...), TP(...), FP(...) follow the same signature. See Metric functions below for the full catalog.

AQL syntax

The patterns above cover most cases. When you need to adapt or extend them, this is the full AQL vocabulary. A metric is a single query that aggregates rows from model (the span table). You pick your dimensions, choose an aggregate or metric function, scope which rows count with WHERE or FILTER, and use CASE or operators inside expressions. Each subsection below walks one of those pieces.

Query shape

Every metric starts with this template. SELECT must return one value — there is no GROUP BY in AQL. WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate.
-- Template
SELECT
  < AGG_FUNC | METRIC_FUNC >(exprs)
  [FILTER (WHERE exprs)]
  [CASE WHEN ... END]
  [ <OPERATOR> { < AGG_FUNC | ARIZE_METRIC >(exprs) | constant }] …
FROM model
[WHERE (exprs)]
-- Example — hallucination rate
SELECT
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" = 'hallucinated') /
    COUNT(*) FILTER (WHERE "eval.Hallucination.label" != null) * 100
FROM model

Constants, expressions, and comments

The general building blocks you use inside a query.
  • Constants — numbers (int or float) or strings in single quotes ('active').
  • Expressions — math/boolean operators combining dimensions and constants; no aggregate inside. Can nest: A * (B + C).
  • Comments-- single line (CMD+/) or /* multi-line */.

Dimensions

A dimension is any span property, feature, tag, prediction, or actual — string or numeric. Wrap any column with dots or spaces in double quotes. Here are the ones you’ll reference most on trace data:
ColumnWhat it captures
"status_code"OK or ERROR on each span
"attributes.openinference.span.kind"LLM, TOOL, RETRIEVER, CHAIN, AGENT, EMBEDDING
"attributes.llm.token_count.prompt" / .completionToken counts per LLM span
"attributes.llm.model_name"Model used (e.g. gpt-4o-mini)
"attributes.tool.name"Name of the tool called on a TOOL span
"attributes.session.id"Session identifier
"eval.<name>.label" / .scoreEval results
"annotation.<name>.label"Human labels

Aggregation functions

Reduce many rows to one number. Every metric must have at least one aggregation or metric function.
FunctionDescriptionType
COUNT(*)Counts the number of rowsn/a
APPROX_COUNT_DISTINCT(exprs)Counts the unique values of exprsString
SUM(exprs)Sums the value across rowsNumeric
AVG(exprs)Averages the value across rowsNumeric
APPROX_QUANTILE(exprs, quantile=<decimal>)Approximate quantile; quantile must be between 0 and 1 inclusiveNumeric
MIN(exprs)Minimum across rowsNumeric
MAX(exprs)Maximum across rowsNumeric

Metric functions

Beyond basic aggregates, AQL ships pre-built statistical metrics. All take keyword arguments. Omit pos_class= to use the model’s configured positive class. See the concrete example in Common patterns #5.
FunctionSignatureReturns
PRECISIONPRECISION(actual=<col>, predicted=<col>, pos_class='<label>')Precision for the positive class
RECALLRECALL(actual=<col>, predicted=<col>, pos_class='<label>')Recall for the positive class
F1F1(actual=<col>, predicted=<col>, pos_class='<label>')F1 score (harmonic mean of precision and recall)
F_BETAF_BETA(actual=<col>, predicted=<col>, pos_class='<label>', beta=<n>)F-score weighted by beta (default 1). beta=0 → precision; beta→∞ → recall
TP / FP / TN / FNSame signature as PRECISIONTrue / false positive / negative rate
ACCURACYACCURACY(actual=<col>, predicted=<col>)Overall accuracy
LOG_LOSSLOG_LOSS(actual=<string col>, predicted=<numeric col>, pos_class='<label>')Log loss
AUCAUC(actual=<col>, predicted=<col>)ROC AUC
MAX_PRECISIONMAX_PRECISION(actual=<col>, predicted=<col>, pos_class='<label>', group_by_column='<col>')Groups by group_by_column, keeps highest-score row per group, computes precision
All four share the signature FUNC(actual=<col>, predicted=<col>).
FunctionReturns
MAEMean absolute error
MAPEMean absolute percentage error
MSEMean squared error
RMSERoot mean squared error
NDCG(ranking_relevance=<col>, prediction_group_id=<col>, rank=<col>, omit_zero_relevance=True, k=10)
Normalized Discounted Cumulative Gain. omit_zero_relevance controls whether 0-relevance rows affect averaging.

Filters — WHERE and FILTER

Scope which rows the aggregate sees. WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate. Identical syntax inside. Subqueries are not supported in either.
-- WHERE: every row the query sees must match
SELECT COUNT(*)
FROM model
WHERE "attributes.openinference.span.kind" = 'LLM'

-- FILTER: the filter only applies to the preceding COUNT
SELECT
    COUNT(*) FILTER (WHERE "eval.Relevance.label" = 'relevant') /
    COUNT(*) * 100
FROM model

Conditionals — CASE

Classify rows into buckets before aggregating. Expressions evaluated in order; the first true branch wins. ELSE gives a default — without one and no match, returns NULL.
-- Example — count errors by bucketing spans into 1/0
SELECT SUM(CASE WHEN "status_code" = 'ERROR' THEN 1 ELSE 0 END)
FROM model
-- Searched CASE
CASE
  WHEN exprs THEN result1
  [ WHEN exprs THEN result2 ]
  [ ELSE default ]
END

-- Simple CASE — equality against a common expression
CASE expression
  WHEN value THEN result
  [WHEN ...]
  [ELSE result]
END

Operators

Math and comparison building blocks used inside expressions. Numeric — apply to numeric dimensions only.
OperatorDescription
A + B / A - B / A * B / A / BArithmetic (integer division cast to FLOAT)
ABS(A)Absolute value
CEIL / FLOORRound up / down to nearest integer
COS, SIN, TAN, TANHTrigonometry
LN, LOG / LOG10Natural log (base e) / log base 10
SQRT / CBRTSquare root / cube root
GREATEST / LEASTGreatest / least of the arguments; NULL if any argument is NULL
Comparison — apply to strings and numerics.
OperatorDescription
A = B / A <> B / A != BEqual / not equal
A > B / A >= B / A < B / A <= BOrdered comparisons
A IS NULL / A IS NOT NULLNull checks

Next step

You’ve defined what to measure. Now put those metrics on a dashboard so you can see them at a glance:

Next: Set Up Dashboards