Improve Your Agent

In the previous guide, the groundedness evaluator revealed a pattern: the chatbot makes claims not in the policy documents. The root cause is the system prompt - it says “be helpful” but doesn’t enforce grounding, so the LLM fills gaps with plausible-sounding information that may not match your actual policies. Rather than guessing at a fix and redeploying, AX gives you a better workflow: start from a real failure, fix it in Playground using the exact inputs that went wrong, then validate across a full dataset before shipping.

This is Part 3 of the Arize AX Get Started series. You should have completed the Evaluations guide first, with evaluation scores visible on your traces.

Step 1: Find a low-scoring trace

Go to your skyserve-chatbot project and filter or sort your traces by the groundedness evaluation score. Find a trace that failed — one where the chatbot made up information not in the policy documents.

Traces filtered by groundedness evaluation showing hallucinated traces

Click into the trace to see the details. Note what the chatbot said that was wrong, and check the retrieved context — the policy document was probably correct, but the LLM added information that wasn’t there.

Step 2: Replay in Prompt Playground

Click Open in Playground on the span. AX automatically populates the system prompt, user message, and model settings that produced the bad answer. You’re now looking at the exact inputs that went wrong. No guessing, no manual setup.

Trace detail for an LLM ChatCompletion span showing trace tree, span evaluations, Open in Playground, and Input Output tab with model and system prompt

Prompt Playground auto-populated from a trace with system prompt, user message, and model

Step 3: Improve the system prompt

The original is too loose:

You are SkyServe Airlines' customer service assistant.
Answer the customer's question based on the provided policy documents.
Be friendly and helpful.

Tighten it with explicit grounding rules:

You are SkyServe Airlines' customer service assistant.

IMPORTANT RULES:
- ONLY answer based on the policy documents provided below.
- If the answer is not in the documents, say:
  "I don't have specific information about that. Please contact our
  support team at 1-800-SKYSERVE for assistance."
- Never make up policies, fees, or conditions not explicitly stated.
- When quoting fees or rules, reference which policy they come from.
- Be friendly and concise.

Click Run to re-generate the response with your updated prompt. You should see a more grounded answer — one that sticks to what the policy documents actually say.

Playground with improved system prompt and new grounded response

Step 4: Test against more traces

Find a few more traces, both passing and failing, and replay with the new prompt. Confirm failing traces improve and passing traces still pass.

Playground testing the improved prompt against a different trace input

The prompt looks better on these traces. But prompt changes can have unintended side effects, and a few spot-checks are not enough to be sure. You need to test against a representative set of queries and measure the difference. That is what the next steps are for.

Step 5: Create a dataset

A dataset is a collection of test cases you will run through your chatbot. Good datasets include common questions, edge cases, and known production failures. Download the sample CSV from the companion notebook or create your own:

input	expected_output
Can I get a refund on my non-refundable ticket?	No cash refund, but a travel credit is issued minus a $75 change fee. Credits expire in 12 months.
How much does a second checked bag cost?	$45 on all fare types.
I’m a Platinum member. Can I change my Basic fare for free?	Yes, Platinum members get free changes on all fares.
My flight was delayed 3 hours. What compensation do I get?	A $50 travel voucher for future SkyServe flights.

Navigate to Datasets, click + New Dataset, upload your CSV, and name it skyserve-test-cases. You can also build datasets directly from traces by selecting specific spans, which is a great way to turn real production failures into a regression suite.

New Dataset dialog showing CSV upload with preview of test cases

Step 6: Run both prompts as experiments

Open your dataset in Playground. Run your original prompt — this will function as your baseline experiment. Then run your improved prompt against the same dataset: paste the refined system prompt from the Playground, or load skyserve-support from Prompt Hub once you have saved it there. You now have two experiments on the same inputs, one for each prompt version.

Experiments tab showing baseline-original-prompt experiment

Playground with improved skyserve-support prompt and dataset ready to run

Step 7: Evaluate and compare

To compare the experiments objectively, add an evaluator that scores the results.

Navigate to your dataset’s experiments view.
Click Add Evaluator.
Select your groundedness-check evaluator from the Eval Hub (the same one you created in the Evaluations guide).

You can also add a Helpfulness evaluator — select it from the pre-built templates to measure whether the new prompt’s answers are still useful.

Add Evaluator flow showing available evaluators from the hub

Once complete, use Compare or Diff Mode to see results side by side. You should see groundedness improve while helpfulness stays the same. Click into individual responses to see exactly what changed.

Compare Experiments view showing two experiments side by side

Experiment comparison detail showing old vs new response for a single input

If you see a regression, load it in Playground and iterate. This is the development loop: trace → evaluate → improve → experiment → repeat.

Step 8: Save to Prompt Hub

Once you are happy with the improved prompt, save it to Prompt Hub for version control.

Click Save to Prompt Hub in the Playground.
Give it a name: skyserve-support.
Add a description: “Customer service prompt with grounding instructions”.
Add a version description: “Added explicit grounding rules to prevent hallucination”.

Save to Prompt Hub dialog with name, description, and version description

Prompt Hub showing skyserve-support version history and prompt template

Your prompt is now versioned and saved. You can see the full version history, compare versions, and roll back if needed. Your team can see what changed and why.

Step 9: Use the prompt in your app

To close the loop, pull the prompt from Prompt Hub in your application code. This way, your app always uses the latest saved version — no code deploy needed to update a prompt. First, install the Prompt Hub package:

pip install "arize[PromptHub]"

Then pull and use the prompt:

from arize.experimental.prompt_hub import ArizePromptClient

prompt_client = ArizePromptClient(
    space_id="YOUR_SPACE_ID",
    api_key="YOUR_API_KEY",
)

# Pull the latest version of your prompt
prompt = prompt_client.get_prompt(name="skyserve-support")

# Use it in your OpenAI call
from openai import OpenAI

oai = OpenAI()
response = oai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": prompt.messages[0]["content"]},
        {"role": "user", "content": f"Policy documents:\n{context}\n\nCustomer question: {question}"},
    ],
)

Now whenever you update the prompt in Prompt Hub, your app picks up the change automatically.

Congratulations!

You’ve completed the full development loop:

Traced your app to see what’s happening inside it.
Evaluated responses automatically to measure quality.
Improved your prompt using real failure data in the Playground.
Proved the improvement works across a representative dataset with experiments.

You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping — you can measure quality and demonstrate improvement. Next up: connect the AI coding agent you use every day (Cursor, Claude Code, Copilot, and others) to Arize AX — paste a single setup URL, install skills, or wire up MCP so tracing, evals, and experiments stay part of how you build.

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Develop

Prompts

Machine Learning

Settings

Security

Step 1: Find a low-scoring trace

Step 2: Replay in Prompt Playground

Step 3: Improve the system prompt

Step 4: Test against more traces

Step 5: Create a dataset

Step 6: Run both prompts as experiments

Step 7: Evaluate and compare

Step 8: Save to Prompt Hub

Step 9: Use the prompt in your app

Congratulations!

Set Up Arize with AI Coding Agents

How to Use Arize

Quickstart

Instrument

Observe

Evaluate

Develop

Prompts

Machine Learning

Settings

Security

​Step 1: Find a low-scoring trace

​Step 2: Replay in Prompt Playground

​Step 3: Improve the system prompt

​Step 4: Test against more traces

​Step 5: Create a dataset

​Step 6: Run both prompts as experiments

​Step 7: Evaluate and compare

​Step 8: Save to Prompt Hub

​Step 9: Use the prompt in your app

​Congratulations!

Set Up Arize with AI Coding Agents

Step 1: Find a low-scoring trace

Step 2: Replay in Prompt Playground

Step 3: Improve the system prompt

Step 4: Test against more traces

Step 5: Create a dataset

Step 6: Run both prompts as experiments

Step 7: Evaluate and compare

Step 8: Save to Prompt Hub

Step 9: Use the prompt in your app

Congratulations!