Skip to main content
Embed the model output and the reference with an embeddings model, then report their cosine similarity. This is the standard fuzzy-match check for free-text outputs — wording differences shouldn’t count as failures as long as the meaning matches. The example uses OpenAI’s text-embedding-3-small. The same shape works for any HTTP embeddings endpoint; swap the client and model name to switch providers.

Code

import math
import os

from openai import OpenAI

_MODEL = "text-embedding-3-small"
_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])


def _embed(text):
    response = _client.embeddings.create(model=_MODEL, input=text)
    return response.data[0].embedding


def _cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    if norm_a == 0.0 or norm_b == 0.0:
        return 0.0
    return dot / (norm_a * norm_b)


def evaluate(output, reference):
    if not output or not reference:
        return {
            "label": "missing",
            "score": 0.0,
            "explanation": "Missing output or reference.",
        }

    similarity = _cosine(_embed(str(output)), _embed(str(reference)))
    return {
        "score": similarity,
        "explanation": (
            f"Cosine similarity {similarity:.4f} (model={_MODEL})."
        ),
    }
Sandbox dependencies — paste into the sandbox configuration’s Dependencies field, one package per line:
openai

Input mapping

ParameterBind to
outputThe model output to score, usually output.
referenceThe ground-truth string, usually reference.

Output configuration

Continuous score in the range -1.0 to 1.0 (cosine similarity). Optimization direction: maximize. In practice, OpenAI’s text-embedding-3 models produce non-negative similarities on natural-language pairs, so a 0.01.0 range with a low-end threshold (e.g. 0.7 for “close enough”) is also reasonable.

Runtime requirements

SettingValue
SandboxA hosted backend that matches your language. Python: E2B, Daytona — Python, Vercel Sandbox — Python, or Modal. TypeScript: Daytona — TypeScript or Vercel Sandbox — TypeScript (the local Deno sandbox is started with --no-npm and cannot install the openai package).
DependenciesPython: openai. TypeScript: openai (npm). Add it under Dependencies when creating the sandbox configuration.
Internet accessRequired — toggle Allow Internet Access on for the configuration. The sandbox must reach api.openai.com.
Environment variablesOPENAI_API_KEY — preferably set as a secret reference to a key in Settings → Secrets, not a literal value.
Each evaluate(...) call makes two embedding requests (one for output, one for reference). When running this across a large dataset:
  • Raise the sandbox configuration’s Timeout if the default is too tight for a cold-start install plus two API calls.
  • Watch the upstream provider’s rate limits and per-token cost — at production volume this adds up fast.
  • If reference is fixed across many examples (e.g. a shared gold answer), pre-compute its embedding once and store it on the example. The evaluator then needs only one API call per row, or none at all if you also pre-embed the output.