Create a dataset

We have four ways of loading data into a dataset

Create a dataset from CSV

You can upload CSVs as a dataset in Arize. Your columns in the file can be accessed in experiments or in prompt playground.


Create a dataset from your spans

Arize supports adding spans from your projects to datasets. The trace data from an application with errors or faulty evals can become fuel for ongoing development. You can use our tracing filters or ✨AI search to curate your dataset.


Create a dataset with code

If you'd like to create your datasets programmatically, you can using our clients to create, update, and delete datasets.

To start let's install the packages we need:

pip install "arize[Datasets]" pandas

You can get your API key by navigating to the "Settings" page.

Let's setup the Arize Dataset Client to create or update a dataset. See here for API reference.

from arize.experimental.datasets import ArizeDatasetsClient
client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)

You can create many different kinds of datasets. The examples below are sorted by complexity.

This is a simple dataset with just string values for the columns.

import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE

# Example dataset
inventions_dataset = pd.DataFrame({
    "attributes.input.value": ["Telephone", "Light Bulb"],
    "attributes.output.value": ["Alexander Graham Bell", "Thomas Edison"],
})


dataset_id = client.create_dataset(space_id=ARIZE_SPACE_ID, dataset_name = "test_invention_dataset", dataset_type=GENERATIVE, data=inventions_dataset)

Create a synthetic dataset

In some cases, the data you have might not be enough to cover all the scenarios you want to test. This is where you can use LLMs to generate examples for you. Here's an example you can try below and then upload to Arize.

Create a CSV of 20 test cases with the following columns:

Input: The name of the invention (ex: "Telephone").
Prompt Variables: A JSON string containing metadata about the invention, such as:
    "invention_name": The name of the invention
    "year": Year the invention was created or patented
    "country": Country of origin
    "category": Field or type of invention (ex: "Communication", "Medicine", "Transportation")
    "source_url": URL to a reliable source or article about the invention
Output: The name of the inventor (ex: "Alexander Graham Bell")

Example preview of generated CSV:

"Telephone","{""invention_name"": ""Telephone"", ""year"": 1876, ""country"": ""United States"", ""category"": ""Communication"", ""source_url"": ""https://en.wikipedia.org/wiki/Telephone""}","Alexander Graham Bell"
"Light Bulb","{""invention_name"": ""Light Bulb"", ""year"": 1879, ""country"": ""United States"", ""category"": ""Electricity"", ""source_url"": ""https://en.wikipedia.org/wiki/Incandescent_light_bulb""}","Thomas Edison"
"Airplane","{""invention_name"": ""Airplane"", ""year"": 1903, ""country"": ""United States"", ""category"": ""Transportation"", ""source_url"": ""https://en.wikipedia.org/wiki/Wright_brothers""}","Wright Brothers"
"Printing Press","{""invention_name"": ""Printing Press"", ""year"": 1440, ""country"": ""Germany"", ""category"": ""Communication"", ""source_url"": ""https://en.wikipedia.org/wiki/Printing_press""}","Johannes Gutenberg"
...

Coming soon, you'll be able to do this directly in the Arize platform based on your traces and prompts, but in the interim, you can upload this data with code or CSV.

Last updated

Was this helpful?