GroundednessEvaluator and writes the scores back via client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments.
Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.
Prerequisites
- Python 3.11+
- An
ARIZE_SPACE_IDandARIZE_API_KEYfrom your Arize AX space settings - An
OPENAI_API_KEYfrom OpenAI Platform (used as both the model under trace and the judge model forGroundednessEvaluator)
Launch Arize AX
If you don’t already have an Arize AX account, sign up at arize.com and grab yourARIZE_SPACE_ID and ARIZE_API_KEY from Settings → Space Settings.
Install
Configure credentials
Define evaluators
The shared setup: a MicrosoftGroundednessEvaluator backed by gpt-4.1-mini, the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client. The judge model is pinned to gpt-4.1-mini because azure-ai-evaluation still sends the legacy max_tokens parameter, which the GPT-5 and o-series families reject. gpt-4.1-mini accepts max_tokens natively and is more deterministic at temperature 0 than gpt-4o-mini.
Flow 1 — Evaluate existing traces
Source the spans
Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.Run the evaluators
GroundednessEvaluator.__call__ is sync — call it once per span, pulling the question / answer / reference triple.
The raw groundedness value is a 1–5 score from the judge LLM, which can drift by one point between runs at the extremes. The doc grades on the deterministic groundedness_result (pass / fail against the configured threshold of 3) and normalizes to 1.0 / 0.0 so the score column is stable across runs. If you want the raw 1–5 number, swap in float(result["groundedness"]).
Log evaluations to Arize AX
Expected output
Verify in Arize AX
Open the project namedmicrosoft-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries a groundedness annotation column showing the normalized 0/1 score and the grounded / ungrounded label.
Flow 2 — Run an experiment
Create a dataset
Define the task
Wrap the evaluators
GroundednessEvaluator.__call__ is already safe to invoke from inside an asyncio loop (the library wraps its async core with async_run_allowing_running_loop), so the experiment evaluator is a plain def, not async def. Return an EvaluationResult with score, label, and explanation populated — leaving any of those as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.
Run the experiment
Expected output
Verify in Arize AX
Open the Datasets + Experiments tab in Arize AX. The datasetmicrosoft-experiment-example-ds-<timestamp> and the experiment microsoft-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the groundedness score and label columns.
Troubleshooting
OpenAIConnection.__init__() missing 1 required positional argument: 'base_url'. The Azure AI Evaluation library requires an explicitbase_urlin themodel_configeven for plain OpenAI. Set it tohttps://api.openai.com/v1as shown in the Define evaluators block.Unsupported parameter: 'max_tokens' is not supported with this model.azure-ai-evaluationsends OpenAI requests with the legacymax_tokensparameter that GPT-5 and o-series models reject. Pin the judge to a model that still acceptsmax_tokens(gpt-4.1-mini,gpt-4o-mini,gpt-4o).column "eval.groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your experiment evaluator returned a bare float or a dict that didn’t fill all three of score / label / explanation. Return a fully-populatedEvaluationResult(...).- Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that
ARIZE_SPACE_ID+ARIZE_API_KEYare right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.comfor US,otlp.eu.arize.comfor EU). - Using Azure OpenAI for the judge model. Swap the
model_configfor{"type": "azure", "api_key": "...", "azure_endpoint": "https://<resource>.openai.azure.com", "azure_deployment": "<deployment-name>", "api_version": "2024-10-21"}. The rest of the doc is unchanged. - Using safety evaluators (HateUnfairness, Violence, etc.) instead. Those require an Azure AI Foundry project and use
AzureAIProject(subscription_id=..., resource_group_name=..., project_name=...)as the evaluator’s second arg instead ofmodel_config. See Microsoft’s safety eval docs. - Using a different score scale. Microsoft’s LLM-judged evaluators (Groundedness, Relevance, Coherence, Fluency, Similarity, Retrieval) all return scores on a 1–5 scale. To project to 0/1 for downstream tooling, normalize before assigning:
score = (raw - 1) / 4. - Experiment re-runs collide. Both names embed
TIMESTAMP = int(time.time())so a single re-run produces unique names. If you re-execute the samecombined.pyquickly, regenerateTIMESTAMPfirst or callarize.experiments.delete(...)/arize.datasets.delete(...)on the prior run’s names.