Examples

RAG

Tests a Retrieval-augmented Generation application built with LlamaIndex, scored on metrics from RAGAS.

Uses a custom Python scoring function to run the HumanEval benchmark, which is a popular dataset for code generation tasks.

Runs the Spider dataset to demo text-to-SQL and relevant scorer functions.

Uses an LLM to grade the output responses and ensure that they do not contain “as a AI language model” in them.

Uses an entity extraction use-case to check for valid JSON outputs.