Model Comparison for an Email Text Extraction Service

In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that accurately summarizes your emails. You will:

  • Upload a dataset of examples containing emails to Arize

  • Define an experiment task that extracts and formats the key details from those emails

  • Devise an evaluator measuring Jaro-Winkler Similarity

  • Run experiments to iterate on your prompt template and to compare the summaries produced by different LLMs


Notebook Walkthrough

We will go through key code snippets on this page. To follow the full tutorial, check out the notebook.

Experiments in Arize

Experiments are made up of 3 elements: a dataset, a task, and an evaluator. The dataset is a collection of the inputs and expected outputs that we'll use to evaluate. The task is an operation that should be performed on each input. Finally, the evaluator compares the result against an expected output.

For this example, here's what each looks like:

  • Dataset - a dataframe of emails to analyze, and the expected output for our agent

  • Task - a langchain agent that extracts key info from our input emails. The result of this task will then be compared against the expected output

  • Eval - Jaro-Winkler distance calculation on the task's output and expected output

Download JSON Data

We've prepared some example emails and actual responses that we can use to evaluate our two models. Let's download those and save them to a temporary file.

dataset_name = "Email Extraction"

with tempfile.NamedTemporaryFile(suffix=".json") as f:
    download_public_dataset(registry[dataset_name].dataset_id, path=f.name)
    df = pd.read_json(f.name)[["inputs", "outputs"]]
df = df.sample(10, random_state=42)

Upload Dataset to Arize

import pandas as pd
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

arize_client = ArizeDatasetsClient(api_key=API_KEY)

dataset_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name=f"{dataset_name}{datetime.now(timezone.utc)}",
    dataset_type=GENERATIVE,
    data=df_formatted,
)

Set Up LangChain

Now we'll set up our Langchain agent. This is a straightforward agent that makes a call to our specified model and formats the response as JSON.

model = "gpt-4o"  # Using gpt-4o for the first experiment

llm = ChatOpenAI(model=model).bind_functions(
    functions=[registry[dataset_name].schema],
    function_call=registry[dataset_name].schema.schema()["title"],
)
output_parser = JsonOutputFunctionsParser()
extraction_chain = registry[dataset_name].instructions | llm | output_parser

Define Task Function

def task(example) -> str:
    input = example.get("inputs")
    return extraction_chain.invoke(input)

Define Evaluator

Next, we need to define our evaluation function. Here we'll use a Jaro-Winkler similarity function that generates a score for how similar the output and expected text are. Jaro-Winkler similarity is technique for measuring edit distance between two strings.

def jarowinkler_similarity(dataset_row, output) -> float:
    return jarowinkler.jarowinkler_similarity(
        json.dumps(dataset_row["outputs"], sort_keys=True),
        json.dumps(output, sort_keys=True),
    )

Run Experiment

Now we're ready to run our experiment. We'll specify our space id, dataset id, task, evaluator, and experiment name in order to generate and evaluate responses.

experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=dataset_id,
    task=task,
    evaluators=[jarowinkler_similarity],
    experiment_name="email_text_extraction_gpt-4o_experiment",
)

Re-run with GPT 3.5 Turbo and Compare Results

To compare results with another model, we simply need to redefine our task. Our dataset and evaluator can stay the same.

model = "gpt-3.5-turbo"

llm = ChatOpenAI(model=model).bind_functions(
    functions=[registry[dataset_name].schema],
    function_call=registry[dataset_name].schema.schema()["title"],
)
extraction_chain = registry[dataset_name].instructions | llm | output_parser
experiment = arize_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=dataset_id,
    task=task,
    evaluators=[jarowinkler_similarity],
    experiment_name="email_text_extraction_gpt-35-turbo_experiment",
)

View Results

Now, if you check your Arize experiments, you can compare Jaro-Winkler scores on a per query basis, and view aggregate model performance results.

The first screenshot below shows a comparison between the average Jaro-Winkler scores for the two experiments we ran.

The second screenshot shows a detailed view of each row's individual Jaro-Winkler score for both experiments. The experiment with GPT-4o is on the left (experiment #1) and the experiment with GPT-3.5-turbo is on the right (experiment #2). The higher the Jaro-Winkler similarity score, the closer the outputted value is to the actual value.

You should see that GPT-4o outperforms its older cousin.

Last updated

Was this helpful?