Aligning LLM Evals with Human Feedback (TypeScript)
In this tutorial, we’ll run a Mastra agent and build a custom evaluator for it. The goal is to understand the workflow for creating evaluators that align with specific use cases.
Use this file to discover all available pages before exploring further.
In this tutorial, you’ll learn how to align your evaluator so it’s tailored to your specific use case. Instead of relying only on pre-built evaluators in Phoenix—which are tested on general benchmark datasets but may miss the nuances of your application—we’ll show you how to build your own.We’ll run a Mastraagent, capture its traces, and then run evaluations on those traces. Using a small set of human-annotated examples as our ground truth, we’ll identify where the evaluator falls short. From there, we’ll refine the evaluation prompt and repeat the cycle until the evaluator’s outputs align with the human annotations.This iterative loop—run agent → gather traces → evaluate → refine—ensures your evaluator evolves to match the exact requirements of your application.
Grab the Mastra agent traces from Phoenix and format them into dataset examples. In this example, we’ll extract the user query, the tool calls, and the agent’s final response. Once formatted, we’ll upload this dataset back into Phoenix for evaluation.
Next, we need human annotations to serve as ground truth for evaluation. To do this, we’ll add an annotation field in the metadata of each dataset example. This way, every example includes a reference label that our evaluator outputs can be compared against.In this example, we’ll evaluate how well the agent’s final response aligns with the tool calls and their outputs. We’ll use three labels for evaluation: aligned, partially_aligned, and misaligned.You can adapt this setup to other evaluation criteria as needed.
Now we’ll start with a basic evaluation prompt and improve it iteratively. The workflow looks like this:Run the evaluator —> Inspect the outputs and experiment results —> Update the evaluation prompt based on what’s lacking —> Repeat until performance improvesWe’ll use Phoenix experiments to identify weaknesses in the evaluator, review explanations, and track performance changes over time.In this tutorial, we’ll go through two improvement cycles, but you can extend this process with more iterations to fine-tune the evaluator further.
const evalPromptTemplateV1 = `You are evaluating whether the agent's final response matches the tool outputs.DATA:- Query: {{query}}- Tool Outputs & Response: {{data}}Choose one label:- "aligned"- "partially_aligned"- "misaligned"Output only the label.`;
After observing results in Phoenix, you can make improvements to your evaluation prompt:
const evalPromptTemplateV2 = `You are evaluating how well an agent's FINAL RESPONSE aligns with the TOOL OUTPUTS it used.You will be given:- The original user query- The agent’s final response- The tool outputs produced by the agentQUERY:{{query}}TOOL + RESPONSE DATA:{{data}}Choose exactly ONE label:- "aligned" → The final response is fully supported by the tool outputs. * Every piece of information in the response can be traced back to the tool calls. * There are no additions, fabrications, or contradictions.- "partially_aligned" → The final response mixes correct tool-based information with extra or inconsistent details. * Some information in the response comes from tool outputs, but other parts are missing, fabricated, or inconsistent. * The response is only partially grounded in the tool calls.- "misaligned" → The final response ignores, contradicts, or invents information unrelated to the tool outputs. * The tool outputs do not support the response at all, or the response is in direct conflict with them.Guidelines:- Focus strictly on whether the content in the final response is supported by the tool outputs.- Do not reward fluent language or style; only check alignment.- Provide a short explanation justifying the label.Your output must contain only one of these labels:aligned, partially_aligned, or misaligned.`;