"Score every response with evidence. Catch regressions in CI. Detect hallucinations claim by claim. Know which prompt phrasings shift scores before you ship."
$ docker-compose run cli regression run \ eval_results.json --id prod-v1 # Checking 4 dimensions vs baseline Regression Result — FAILED baseline: prod-v1 checks: 4 dimensions regressions: faithfulness: dropped -1.20 (threshold 0.5) improvements: coherence: +0.40 # Exit code 1 — blocks merge in CI
Teams switch between models, upgrade prompts, and change fine-tuning datasets — then decide which version is better by reading a few outputs and going with gut feel. There is no audit trail. No reproducible score. No way to gate a model upgrade in CI without it.
LLM Eval Suite scores every response across task-specific dimensions with evidence — a quote from the response that drove each score. Run it in three modes: score independently, compare head-to-head, or rank multiple responses. Plug JUnit XML output into GitHub Actions and model upgrades either pass or fail like any other test.
Task type determines the dimension set. Every score is grounded in a quote from the response being evaluated.
Eight capabilities, each targeting a different failure mode in LLM evaluation — from gut-feel scoring to prompt fragility to judge bias.
git clone https://github.com/swapnanil/llm-eval-suite cd llm-eval-suite cp .env.example .env # add your ANTHROPIC_API_KEY docker-compose up api
docker-compose run cli eval \ --file examples/eval_qa.json \ --mode compare --format markdown
docker-compose run cli hallucination \ --response output.txt --source source.txt --format markdown
docker-compose run cli regression run results.json \ --id prod-baseline --format markdown
- name: Run LLM eval run: docker-compose run cli eval \ --file evals/suite.json \ --mode rank --format junit \ --output results.xml - uses: mikepenz/action-junit-report@v3 with: report_paths: results.xml - name: Regression check run: docker-compose run cli regression run \ results.json --id prod-baseline # exits 1 if any dimension drops beyond threshold
{ "task_type": "qa", "eval_mode": "compare", "source": "Refunds within 14 days if unused", "responses": [{ "label": "Response A", "text": "Refund in 14 days if unused." }, { "label": "Response B", "text": "30-day return, no questions asked." }] }
winner: Response A margin: clear Response B — Faithfulness score: 1.0 / 10 reasoning: "States '30-day policy' — source specifies 14 days. Clear hallucin- ation." quote: "30-day return policy, no questions asked"
Each tool is a standalone CLI + REST API solving a real enterprise problem with Claude.