Datasets

Version controlled examples to run your experiments

Datasets are the backbone of effective LLM experimentation, providing structured collections of examples for evaluation and iteration. Datasets allow you to test models consistently across any real-world scenarios and edge cases, quickly identify regressions, and track measurable improvements.

In Arize, datasets are fully integrated, allowing you to run experiments in the UI or programmatically via the SDK.

Golden Datasets: Compare Against Ideal Outputs

Curating golden datasets allows you to establish a reliable benchmark. A golden dataset provides a consistent and trusted "ground truth" for LLM outputs. By meticulously hand-labeling ideal responses, you create a stable benchmark that allows you to objectively measure and compare the performance of different models and prompt versions over time.

Regression Datasets: Focus on Areas of Improvements

A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.


Flexible Dataset Format

Arize supports flexible dataset formats so you can structure data in the way that best fits your LLM application:

1. Key-Value Pairs: Flexible for multi-input/multi-output tasks such as function calls, agents, or classification, ensuring complex workflows can be tested consistently.

Input
Context
Output
What is Paul Graham known for?

"Paul Graham is an investor, entrepreneur, and computer scientist known for..."

"Paul Graham is known for co-founding Y Combinator..."}

2. Prompt-Completion (String Pairs): Simple format for validating single-turn completions, making it easy to measure correctness against expected outputs.

Input
Output
"do you have to have two license plates in ontario"}
"True"

3. Messages or Chat Format: Purpose-built for conversational agents, allowing you to evaluate multi-turn interactions in context.

Input:
{"messages": [{"role": "system", "content": "You are an expert SQL assistant"}]}
Output:
{"messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]}

Learn More

Video tutorial

Learn more about evals

Read our evaluation concepts page

Last updated

Was this helpful?