How to set up custom metrics
Open your project, click the... menu in the top right, and select Custom Metrics. From there, you have three ways to create one:
- Alyx
- UI
- Code
Click New Custom Metric. On the right side of the page, Alyx can either recommend a metric or translate one you describe into AQL.Ask for a recommendation if you’re not sure what to track:
- “What metrics should I set up for this project?”
- “Suggest a quality metric based on the evals I’m running”
- “Count responses flagged as hallucinated”
- “Percentage of retriever spans where Relevance is ‘relevant’”
- “Total token cost divided by number of distinct sessions”
- “A weighted score: 25% QA correctness plus 75% relevance”

Common patterns
Wondering what to put in the SQL Query box? Most custom metrics fall into one of five shapes. Each section below has a template you can adapt plus a concrete example. Every example starts with a-- description comment, which is the format Alyx generates and a useful habit for readability.
1. Count of an eval label
How many times did a specific label occur? PairCOUNT(*) with a FILTER (WHERE ...) clause.
2. Rate (% of total)
What percentage of traffic has a given eval label, status, or attribute? Divide a filtered count by an unfiltered count.3. Weighted composite score
Combine multiple eval scores into one number — useful when quality depends on several dimensions (e.g., correctness and relevance)..score, not .label.
4. Cost and usage
Turn token counts into dollars, count distinct sessions, and roll them up together or separately. Total cost2.5 and 10 with your model’s per-million input/output token rates.
Distinct sessions
229 with your eval prompt template’s token length, 104 with the estimated output token length, and the rates with your eval model’s per-token input/output costs.
5. Eval quality against human labels
When you have both auto-evals and human annotations, measure how well the eval matches the human ground truth.RECALL(...), F1(...), TP(...), FP(...) follow the same signature. See Metric functions below for the full catalog.
AQL syntax
The patterns above cover most cases. When you need to adapt or extend them, this is the full AQL vocabulary. A metric is a single query that aggregates rows frommodel (the span table). You pick your dimensions, choose an aggregate or metric function, scope which rows count with WHERE or FILTER, and use CASE or operators inside expressions. Each subsection below walks one of those pieces.
Query shape
Every metric starts with this template.SELECT must return one value — there is no GROUP BY in AQL. WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate.
Constants, expressions, and comments
The general building blocks you use inside a query.- Constants — numbers (int or float) or strings in single quotes (
'active'). - Expressions — math/boolean operators combining dimensions and constants; no aggregate inside. Can nest:
A * (B + C). - Comments —
-- single line(CMD+/) or/* multi-line */.
Dimensions
A dimension is any span property, feature, tag, prediction, or actual — string or numeric. Wrap any column with dots or spaces in double quotes. Here are the ones you’ll reference most on trace data:| Column | What it captures |
|---|---|
"status_code" | OK or ERROR on each span |
"attributes.openinference.span.kind" | LLM, TOOL, RETRIEVER, CHAIN, AGENT, EMBEDDING |
"attributes.llm.token_count.prompt" / .completion | Token counts per LLM span |
"attributes.llm.model_name" | Model used (e.g. gpt-4o-mini) |
"attributes.tool.name" | Name of the tool called on a TOOL span |
"attributes.session.id" | Session identifier |
"eval.<name>.label" / .score | Eval results |
"annotation.<name>.label" | Human labels |
Aggregation functions
Reduce many rows to one number. Every metric must have at least one aggregation or metric function.| Function | Description | Type |
|---|---|---|
COUNT(*) | Counts the number of rows | n/a |
APPROX_COUNT_DISTINCT(exprs) | Counts the unique values of exprs | String |
SUM(exprs) | Sums the value across rows | Numeric |
AVG(exprs) | Averages the value across rows | Numeric |
APPROX_QUANTILE(exprs, quantile=<decimal>) | Approximate quantile; quantile must be between 0 and 1 inclusive | Numeric |
MIN(exprs) | Minimum across rows | Numeric |
MAX(exprs) | Maximum across rows | Numeric |
Metric functions
Beyond basic aggregates, AQL ships pre-built statistical metrics. All take keyword arguments. Omitpos_class= to use the model’s configured positive class. See the concrete example in Common patterns #5.
Classification
Classification
| Function | Signature | Returns |
|---|---|---|
PRECISION | PRECISION(actual=<col>, predicted=<col>, pos_class='<label>') | Precision for the positive class |
RECALL | RECALL(actual=<col>, predicted=<col>, pos_class='<label>') | Recall for the positive class |
F1 | F1(actual=<col>, predicted=<col>, pos_class='<label>') | F1 score (harmonic mean of precision and recall) |
F_BETA | F_BETA(actual=<col>, predicted=<col>, pos_class='<label>', beta=<n>) | F-score weighted by beta (default 1). beta=0 → precision; beta→∞ → recall |
TP / FP / TN / FN | Same signature as PRECISION | True / false positive / negative rate |
ACCURACY | ACCURACY(actual=<col>, predicted=<col>) | Overall accuracy |
LOG_LOSS | LOG_LOSS(actual=<string col>, predicted=<numeric col>, pos_class='<label>') | Log loss |
AUC | AUC(actual=<col>, predicted=<col>) | ROC AUC |
MAX_PRECISION | MAX_PRECISION(actual=<col>, predicted=<col>, pos_class='<label>', group_by_column='<col>') | Groups by group_by_column, keeps highest-score row per group, computes precision |
Regression
Regression
All four share the signature
FUNC(actual=<col>, predicted=<col>).| Function | Returns |
|---|---|
MAE | Mean absolute error |
MAPE | Mean absolute percentage error |
MSE | Mean squared error |
RMSE | Root mean squared error |
Ranking
Ranking
omit_zero_relevance controls whether 0-relevance rows affect averaging.Filters — WHERE and FILTER
Scope which rows the aggregate sees.WHERE scopes the whole query; FILTER (WHERE ...) scopes just the preceding aggregate. Identical syntax inside. Subqueries are not supported in either.
Conditionals — CASE
Classify rows into buckets before aggregating. Expressions evaluated in order; the firsttrue branch wins. ELSE gives a default — without one and no match, returns NULL.
Operators
Math and comparison building blocks used inside expressions. Numeric — apply to numeric dimensions only.| Operator | Description |
|---|---|
A + B / A - B / A * B / A / B | Arithmetic (integer division cast to FLOAT) |
ABS(A) | Absolute value |
CEIL / FLOOR | Round up / down to nearest integer |
COS, SIN, TAN, TANH | Trigonometry |
LN, LOG / LOG10 | Natural log (base e) / log base 10 |
SQRT / CBRT | Square root / cube root |
GREATEST / LEAST | Greatest / least of the arguments; NULL if any argument is NULL |
| Operator | Description |
|---|---|
A = B / A <> B / A != B | Equal / not equal |
A > B / A >= B / A < B / A <= B | Ordered comparisons |
A IS NULL / A IS NOT NULL | Null checks |