LLM · Evaluation · Regression · CI/CD

LLM Output
Eval

"Score every response with evidence. Catch regressions in CI. Detect hallucinations claim by claim. Know which prompt phrasings shift scores before you ship."

$ docker-compose run cli regression run \
  eval_results.json --id prod-v1

# Checking 4 dimensions vs baseline

Regression Result — FAILED
baseline:  prod-v1
checks:   4 dimensions
regressions:
  faithfulness: dropped -1.20
  (threshold 0.5)
improvements:
  coherence: +0.40

# Exit code 1 — blocks merge in CI

Teams switch between models, upgrade prompts, and change fine-tuning datasets — then decide which version is better by reading a few outputs and going with gut feel. There is no audit trail. No reproducible score. No way to gate a model upgrade in CI without it.

LLM Eval Suite scores every response across task-specific dimensions with evidence — a quote from the response that drove each score. Run it in three modes: score independently, compare head-to-head, or rank multiple responses. Plug JUnit XML output into GitHub Actions and model upgrades either pass or fail like any other test.


Evidence-backed.
CI-ready.

Task type determines the dimension set. Every score is grounded in a quote from the response being evaluated.

01
Input
Prompt, one or more LLM responses, task type, and optional source document for faithfulness scoring
02
Select
Task type selects the calibrated dimension set — QA, summarisation, instruction following, or code generation
03
Judge
Claude judges each dimension with a score, a quote from the response, and a specific reason for the rating
04
Output
Scores, winner, ranking, recommendation — as Markdown, JSON, or JUnit XML for CI/CD integration

Evidence.
Regression.
Consensus.

Eight capabilities, each targeting a different failure mode in LLM evaluation — from gut-feel scoring to prompt fragility to judge bias.

Multi-Dimensional Scoring
Ten task presets — QA, summarisation, RAG, code generation, creative writing, and more. Each dimension score includes a verbatim quote from the response. Not "this is good" but specific textual evidence.
Regression Testing
Save any eval report as a named baseline. Run future evals against it — per-dimension deltas are compared against configurable thresholds. Exit code 1 in CI when scores drop below your floor.
Hallucination Detection
Claim-level analysis against a source document. Each claim is either supported or unsupported — binary, not "mostly faithful." Risk levels: none, low, moderate, high, critical, with a safe_to_use flag for downstream gating.
Prompt Sensitivity Analysis
Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile and which are stable. Know which phrasings shift your scores before you deploy.
Panel Evaluation
Run N independent judge passes on the same eval. Mean and variance per dimension expose where judges agree and where they disagree. High-variance dimensions are flagged for human review automatically.
RAGAS-Compatible RAG Preset
The rag task type maps faithfulness, answer relevancy, context precision, and context recall — the four RAGAS metrics — as first-class evaluation dimensions with equal weighting.

Running in
three minutes.

Setup
git clone https://github.com/swapnanil/llm-eval-suite
cd llm-eval-suite
cp .env.example .env   # add your ANTHROPIC_API_KEY
docker-compose up api
CLI — compare two responses
docker-compose run cli eval \
  --file examples/eval_qa.json \
  --mode compare --format markdown
CLI — hallucination detection
docker-compose run cli hallucination \
  --response output.txt --source source.txt --format markdown
CLI — regression check vs saved baseline
docker-compose run cli regression run results.json \
  --id prod-baseline --format markdown
GitHub Actions — gate model upgrades in CI
- name: Run LLM eval
  run: docker-compose run cli eval \
    --file evals/suite.json \
    --mode rank --format junit \
    --output results.xml
- uses: mikepenz/action-junit-report@v3
  with:
    report_paths: results.xml
- name: Regression check
  run: docker-compose run cli regression run \
    results.json --id prod-baseline
  # exits 1 if any dimension drops beyond threshold

Two responses in.
Clear winner out.

Input — compare eval JSON
{
  "task_type": "qa",
  "eval_mode": "compare",
  "source":
    "Refunds within 14
     days if unused",
  "responses": [{
    "label": "Response A",
    "text":
      "Refund in 14 days
       if unused."
  }, {
    "label": "Response B",
    "text":
      "30-day return,
       no questions asked."
  }]
}
Output — evaluation result
winner:  Response A
margin:  clear

Response B — Faithfulness
score:     1.0 / 10
reasoning:
  "States '30-day policy'
   — source specifies 14
   days. Clear hallucin-
   ation."
quote:
  "30-day return policy,
   no questions asked"

Five more tools.
Same standard.

Each tool is a standalone CLI + REST API solving a real enterprise problem with Claude.