Skip to main content
All evals templates are tested against golden data that are available as part of the LLM eval library’s benchmarked data and target precision at 70-90% and F1 at 70-85%.

1. Faithfulness Eval


Faithfulness of responses grounded in context


Tested on:

HaluEval QA Dataset, HaluEval RAG Dataset

2. Code Metrics

3. Q&A Eval


Private data Q&A Eval


Tested on:

WikiQA

4. Retrieval Eval


RAG individual retrieval


Tested on:

MS Marco, WikiQA

5. Summarization Eval


Summarization performance


Tested on:

GigaWorld, CNNDM, Xsum

6. Code Generation Eval


Code writing correctness and readability


Tested on:

WikiSQL, HumanEval, CodeXGlu

7. Toxicity Eval

9. Reference Link

10. User Frustration

12. Agent Function Calling