Building effective text-to-SQL systems requires rigorous evaluation and systematic experimentation. In this tutorial, we’ll walk through the complete evaluation-driven development process, starting from scratch without pre-existing datasets of questions or expected responses.
Use this file to discover all available pages before exploring further.
We’ll use a movie database containing recent titles, ratings, box office performance, and metadata to demonstrate how to build, evaluate, and systematically improve a text-to-SQL system using Phoenix’s experimentation framework. Think of Phoenix as your scientific laboratory, meticulously recording every experiment to help you build better AI systems.
Let’s first start a phoenix server to act as our evaluation dashboard and experiment tracker. This will be our central hub for observing, measuring, and improving our text-to-SQL system.Note: this step is not necessary if you already have a Phoenix server running.
import phoenix as pxfrom phoenix.client import Clientpx_client = Client()px.launch_app().view()
Let’s also setup tracing for OpenAI. Tracing is crucial for evaluation-driven development - it allows Phoenix to observe every step of our text-to-SQL pipeline, capturing inputs, outputs, and metrics like latency and cost that we’ll use to systematically improve our system.
from phoenix.otel import registertracer_provider = register( endpoint="http://localhost:6006/v1/traces", auto_instrument=True, verbose=False) # Instruments all OpenAI callstracer = tracer_provider.get_tracer(__name__)
Let’s make sure we can run async code in the notebook.
import nest_asyncionest_asyncio.apply()
Lastly, let’s make sure we have our OpenAI API key set up.
import osfrom getpass import getpassif not os.getenv("OPENAI_API_KEY"): os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")
We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data.
import duckdbfrom datasets import load_datasetdata = load_dataset("wykonos/movies")["train"]conn = duckdb.connect(database=":memory:", read_only=False)conn.register("movies", data.to_pandas())records = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")for record in records: print(record)
Let’s start by implementing a simple text2sql logic.
import osimport openaiclient = openai.AsyncClient()columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")# We will use GPT-4o to startTASK_MODEL = "gpt-4o"CONFIG = {"model": TASK_MODEL}system_prompt = ( "You are a SQL expert, and you are given a single table named movies with the following columns:\n" f'{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n' "Write a SQL query corresponding to the user's request. Return just the query text, " "with no formatting (backticks, markdown, etc.).")@tracer.chainasync def generate_query(input): response = await client.chat.completions.create( model=TASK_MODEL, temperature=0, messages=[ { "role": "system", "content": system_prompt, }, { "role": "user", "content": input, }, ], ) return response.choices[0].message.content
query = await generate_query("what was the most popular movie?")print(query)
Awesome, looks like the we are producing SQL! let’s try running the query and see if we get the expected results.
Effective AI evaluation rests on three fundamental pillars:
Data: Curated examples that represent real-world use cases
Task: The actual function or workflow being evaluated
Evaluators: Quantitative measures of performance
Let’s start by creating our data - a set of movie-related questions that we want our text-to-SQL system to handle correctly.
questions = [ "Which Brad Pitt movie received the highest rating?", "What is the top grossing Marvel movie?", "What foreign-language fantasy movie was the most popular?", "what are the best sci-fi movies of 2017?", "What anime topped the box office in the 2010s?", "Recommend a romcom that stars Paul Rudd.",]
Let’s store the data above as a versioned dataset in phoenix.
import pandas as pdds = px_client.datasets.create_dataset( name="movie-example-questions", dataframe=pd.DataFrame([{"question": question} for question in questions]), input_keys=["question"], output_keys=[],)# If you have already uploaded the dataset, you can fetch it using the following line# ds = px_client.datasets.get_dataset(dataset="movie-example-questions")
Next, we’ll define the task. The task is to generate SQL queries from natural language questions.
Finally, we’ll define the evaluation scores. We’ll use the following simple functions to see if the generated SQL queries are correct. Note that has_results is a good metric here because we know that all the questions we added to the dataset can be answered via SQL.
# Test if there are no sql execution errorsdef no_error(output): return 1.0 if output.get("error") is None else 0.0# Test if the query has resultsdef has_results(output): results = output.get("results") has_results = results is not None and len(results) > 0 return 1.0 if has_results else 0.0
Now let’s run the evaluation experiment.
import phoenix as pxfrom phoenix.client.experiments import run_experiment# Define the task to run text2sql on the input questiondef task(input): return text2sql(input["question"])experiment = run_experiment( dataset=ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG)
Great! Let’s see how our baseline model performed on the movie questions. We can analyze both successful queries and any failures to understand where improvements are needed.
Now that we ran the initial evaluation, let’s analyze what might be causing any failures.From looking at the query where there are no results, genre-related queries might fail because the model doesn’t know how genres are stored (e.g., “Sci-Fi” vs “Science Fiction”)These types of issues would probably be improved by showing a sample of the data to the model (few-shot examples) since the data will show the LLM what is queryable.Let’s try to improve the prompt with few-shot examples and see if we can get better results.
samples = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")example_row = "\n".join( f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}" for column in columns)column_header = " | ".join(column["column_name"] for column in columns)few_shot_examples = "\n".join( " | ".join(str(sample[column["column_name"]]) for column in columns) for sample in samples)system_prompt = ( "You are a SQL expert, and you are given a single table named `movies` with the following columns:\n\n" "Column | Type | Example\n" "-------|------|--------\n" f"{example_row}\n" "\n" "Examples:\n" f"{column_header}\n" f"{few_shot_examples}\n" "\n" "Write a DuckDB SQL query corresponding to the user's request. " "Return just the query text, with no formatting (backticks, markdown, etc.).")async def generate_query(input): response = await client.chat.completions.create( model=TASK_MODEL, temperature=0, messages=[ { "role": "system", "content": system_prompt, }, { "role": "user", "content": input, }, ], ) return response.choices[0].message.contentprint(await generate_query("what are the best sci-fi movies in the 2000s?"))
Looking much better! Finally, let’s add a scoring function that compares the results, if they exist, with the expected results.
Amazing. It looks like the LLM is generating a valid query for all questions. Let’s try out using LLM as a judge to see how well it can assess the results.
import jsonfrom openai import OpenAIfrom phoenix.client.experiments import evaluate_experimentfrom phoenix.client.experiments import create_evaluatorfrom phoenix.client.resources.experiments.types import ExperimentEvaluation as EvaluationResultopenai_client = OpenAI()judge_instructions = """You are a judge that determines if a given question can be answered with the provided SQL query and results.Make sure to ensure that the SQL query maps to the question accurately.Provide the label `correct` if the SQL query and results accurately answer the question.Provide the label `invalid` if the SQL query does not map to the question or is not valid."""@create_evaluator(name="qa_correctness", kind="llm")def qa_correctness(input, output): question = input.get("question") query = output.get("query") results = output.get("results") response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": judge_instructions}, { "role": "user", "content": f"Question: {question}\nSQL Query: {query}\nSQL Results: {results}", }, ], tool_choice="required", tools=[ { "type": "function", "function": { "name": "qa_correctness", "description": "Determine if the SQL query and results accurately answer the question.", "parameters": { "type": "object", "properties": { "explanation": { "type": "string", "description": "Explain why the label is correct or invalid.", }, "label": {"type": "string", "enum": ["correct", "invalid"]}, }, }, }, } ], ) if response.choices[0].message.tool_calls is None: raise ValueError("No tool call found in response") args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) label = args["label"] explanation = args["explanation"] score = 1 if label == "correct" else 0 return EvaluationResult(score=score, label=label, explanation=explanation)evaluate_experiment(experiment=experiment, evaluators=[qa_correctness])
The LLM judge’s scoring closely matches our manual evaluation, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement.The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that ‘Anime’ and ‘Animation’ are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation.We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in Golden Dataset for regression testing as well.
Let’s generate some training data by having the model describe existing SQL queries from our dataset
import jsonfrom typing import Listfrom pydantic import BaseModelclass Question(BaseModel): sql: str question: strclass Questions(BaseModel): questions: List[Question]sample_rows = "\n".join( f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}" for column in columns)synthetic_data_prompt = f"""You are a SQL expert, and you are given a single table named movies with the following columns:Column | Type | Example-------|------|--------{sample_rows}Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as thequestion that the query answers. Keep the questions bounded so that they are not too broad or too narrow."""response = await client.chat.completions.create( model="gpt-4o", temperature=0, messages=[ { "role": "user", "content": synthetic_data_prompt, } ], tools=[ { "type": "function", "function": { "name": "generate_questions", "description": "Generate SQL queries that would be interesting to ask about this table.", "parameters": Questions.model_json_schema(), }, } ], tool_choice={"type": "function", "function": {"name": "generate_questions"}},)assert response.choices[0].message.tool_calls is not Nonegenerated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[ "questions"]print("Generated N questions: ", len(generated_questions))print("First question: ", generated_questions[0])
generated_dataset = []for q in generated_questions: try: result = execute_query(q["sql"]) example = { "input": q["question"], "expected": { "results": result or [], "query": q["sql"], }, "metadata": { "category": "Generated", }, } print(example) generated_dataset.append(example) except duckdb.Error as e: print(f"Query failed: {q['sql']}", e) print("Skipping...")generated_dataset[0]
Awesome, let’s create a dataset with the new synthetic data.
synthetic_dataset = px_client.datasets.create_dataset( name="movies-golden-synthetic", inputs=[{"question": example["input"]} for example in generated_dataset], outputs=[example["expected"] for example in generated_dataset],);
In this tutorial, we built a text-to-SQL system for querying movie data. We started with basic examples and evaluators, then improved performance by adding few-shot examples as well as using an LLM judge for evaluation.Key takeaways:
Start with simple evaluators to catch basic issues
Use few-shot examples to improve accuracy
Generate more training data using LLMs
Track progress with Phoenix’s experiments
You can further improve this system by adding better evaluators or handling edge cases.