Datasets
Version controlled examples to run your experiments
Datasets are the backbone of effective LLM experimentation, providing structured collections of examples for evaluation and iteration. Datasets allow you to test models consistently across any real-world scenarios and edge cases, quickly identify regressions, and track measurable improvements.
In Arize, datasets are fully integrated, allowing you to run experiments in the UI or programmatically via the SDK.
Golden Datasets: Compare Against Ideal Outputs
Curating golden datasets allows you to establish a reliable benchmark. A golden dataset provides a consistent and trusted "ground truth" for LLM outputs. By meticulously hand-labeling ideal responses, you create a stable benchmark that allows you to objectively measure and compare the performance of different models and prompt versions over time.
Regression Datasets: Focus on Areas of Improvements
A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.
Flexible Dataset Format
Arize supports flexible dataset formats so you can structure data in the way that best fits your LLM application:
1. Key-Value Pairs: Flexible for multi-input/multi-output tasks such as function calls, agents, or classification, ensuring complex workflows can be tested consistently.
What is Paul Graham known for?
"Paul Graham is an investor, entrepreneur, and computer scientist known for..."
"Paul Graham is known for co-founding Y Combinator..."}
2. Prompt-Completion (String Pairs): Simple format for validating single-turn completions, making it easy to measure correctness against expected outputs.
"do you have to have two license plates in ontario"}
"True"
3. Messages or Chat Format: Purpose-built for conversational agents, allowing you to evaluate multi-turn interactions in context.
Input:
{"messages": [{"role": "system", "content": "You are an expert SQL assistant"}]}
Output:
{"messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]}
Learn More
Last updated
Was this helpful?