AnswerAccuracy (does the response match a reference), ContextRelevance (does the retrieved context cover the question), and ResponseGroundedness (is the response actually supported by the context). They’re tuned to match NVIDIA’s published RAG quality benchmarks while reusing Ragas’s prompting and execution machinery.
This guide shows both ways to wire them into Arize AX: Flow 1 grades existing Arize AX traces with ResponseGroundedness and writes the scores back via client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments. For the sibling Ragas integration (the standard metrics like Faithfulness and AnswerRelevancy), see the Ragas evaluation guide.
Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.
Prerequisites
- Python 3.11+
- An
ARIZE_SPACE_IDandARIZE_API_KEYfrom your Arize AX space settings - An
OPENAI_API_KEYfrom OpenAI Platform (used as both the model under trace and the judge model for NVIDIA’s metrics)
Launch Arize AX
If you don’t already have an Arize AX account, sign up at arize.com and grab yourARIZE_SPACE_ID and ARIZE_API_KEY from Settings → Space Settings.
Install
Configure credentials
Define evaluators
The shared setup: NVIDIA’s v2ResponseGroundedness metric (from ragas.metrics.collections) backed by gpt-5-mini via Ragas’s llm_factory, the canonical 2-row hallucination dataset both flows score, and an Arize SDK client. temperature=1.0 is passed explicitly because gpt-5 only supports the default temperature.
Flow 1 — Evaluate existing traces
Source the spans
Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.Run the evaluators
The v2 metric exposesascore(response=..., retrieved_contexts=[...]) directly — no SingleTurnSample wrapper. It’s async, so wrap each call in asyncio.run(...) from sync code. Flow 2 below uses the same metric from inside an async def evaluator wrapper.
Log evaluations to Arize AX
Expected output
Verify in Arize AX
Open the project namednv-ragas-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries an nv_response_groundedness annotation column showing the 0/1 score and the grounded / ungrounded label.
Flow 2 — Run an experiment
Create a dataset
Define the task
Wrap the evaluators
Experiment evaluators run inside anasyncio loop, so the wrapper is async def and awaits response_groundedness.ascore(...) directly. Return an EvaluationResult with score, label, and explanation populated — leaving any of those as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.
Run the experiment
Expected output
Verify in Arize AX
Open the Datasets + Experiments tab in Arize AX. The datasetnv-ragas-experiment-example-ds-<timestamp> and the experiment nv-ragas-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the nv_response_groundedness score and label columns.
Troubleshooting
Skipping a sample by assigning it nan score. The judge call failed (rate limit, model error, etc.) and Ragas swallowed the exception. Check the warning lines just above this message in stderr for the actual error.column "eval.nv_response_groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your experiment evaluator returned a bare float instead of a fully-populatedEvaluationResult(score=..., label=..., explanation=...). Arize AX’s Flight server rejects null reserved columns.- Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that
ARIZE_SPACE_ID+ARIZE_API_KEYare right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.comfor US,otlp.eu.arize.comfor EU). - Using
AnswerAccuracyorContextRelevanceinstead. SwapResponseGroundednessforAnswerAccuracy(requiresreferencefield on the sample) orContextRelevance(requiresuser_input+retrieved_contexts). The wiring is otherwise identical. See Ragas NVIDIA metrics docs for the per-metric required fields. - Using the official NVIDIA RAG-Eval suite directly (not via Ragas). The standalone
nvidia-rag-evalpackage exists but is paywalled behind NVIDIA AI Enterprise. The Ragas wrappers used here are the open, community-supported path to the same metric definitions. - Experiment re-runs collide. Both names embed
TIMESTAMP = int(time.time())so a single re-run produces unique names. If you re-execute the samecombined.pyquickly, regenerateTIMESTAMPfirst or callarize.experiments.delete(...)/arize.datasets.delete(...)on the prior run’s names.