0 means the two structures are identical; higher scores mean more fields drifted from the reference.
Reach for this when you have a golden dataset — examples paired with the exact JSON a correct model run should produce — and you want to know how close the output got, not just whether it was perfect. Typical cases:
- Structured extraction. The model pulls fields out of a document (invoice line items, contact records, form data) and you have hand-labeled JSON for each example. A binary match collapses “one wrong field” and “everything wrong” into the same score; distance tells them apart, which is what you want when tracking regressions across prompt or model changes.
- Tool call arguments. An agent emits a tool call whose
argumentsobject should match a known-good payload. Per-field distance pinpoints whether the model is consistently dropping one argument vs. hallucinating a different shape entirely. - Prompt-change A/B. You’re comparing two prompt versions against the same golden references. Mean distance moves smoothly as quality changes; mean exact-match doesn’t, because most diffs are partial.
output == reference. Use distance when partial credit matters.
Code
- Python
- TypeScript
Input mapping
| Parameter | Bind to |
|---|---|
output | The model output — usually output, or a nested path like output.tool_calls[0].arguments if the JSON lives inside a larger blob. |
reference | The ground-truth JSON from your golden dataset — typically reference. |
Output configuration
Continuous score:| Field | Value |
|---|---|
| Score range | 0 (identical) to unbounded |
| Optimization direction | minimize |
| Threshold | Optional — e.g., 0 to color any non-exact match as a regression. |
label is informational; the score is the primary signal.
Runtime requirements
| Setting | Value |
|---|---|
| Sandbox | Any — works in the in-process WebAssembly (Python) or Deno (TypeScript) backends. |
| Dependencies | None — uses json / built-in JSON. |
| Internet access | Not required. |
| Environment variables | None. |

