Get started
Examples
Examples for common scenarios
RAG
Tests a Retrieval-augmented Generation application built with LlamaIndex, scored on metrics from RAGAS.
HumanEval
Uses a custom Python scoring function to run the HumanEval benchmark, which is a popular dataset for code generation tasks.
Spider
Runs the Spider dataset to demo text-to-SQL and relevant scorer functions.
Chat bot
Uses an LLM to grade the output responses and ensure that they do not contain “as a AI language model” in them.
Basic
Uses an entity extraction use-case to check for valid JSON outputs.