LLM · Evaluation · Regression · CI/CD

LLM Output
Eval

"Score every response with evidence. Catch regressions in CI. Detect hallucinations claim by claim. Know which prompt phrasings shift scores before you ship."

View on GitHub Read the Docs

$ docker-compose run cli regression run \
  eval_results.json --id prod-v1

# Checking 4 dimensions vs baseline

Regression Result — FAILED
baseline:  prod-v1
checks:   4 dimensions
regressions:
  faithfulness: dropped -1.20
  (threshold 0.5)
improvements:
  coherence: +0.40

# Exit code 1 — blocks merge in CI

Problem

Teams switch between models, upgrade prompts, and change fine-tuning datasets — then decide which version is better by reading a few outputs and going with gut feel. There is no audit trail. No reproducible score. No way to gate a model upgrade in CI without it.

LLM Eval Suite scores every response across task-specific dimensions with evidence — a quote from the response that drove each score. Run it in three modes: score independently, compare head-to-head, or rank multiple responses. Plug JUnit XML output into GitHub Actions and model upgrades either pass or fail like any other test.

Architecture

Evidence-backed.
CI-ready.

Task type determines the dimension set. Every score is grounded in a quote from the response being evaluated.

Input

Prompt, one or more LLM responses, task type, and optional source document for faithfulness scoring

Select

Task type selects the calibrated dimension set — QA, summarisation, instruction following, or code generation

Judge

The model judges each dimension with a score, a quote from the response, and a specific reason for the rating

Output

Scores, winner, ranking, recommendation — as Markdown, JSON, or JUnit XML for CI/CD integration

Features

Evidence.
Regression.
Consensus.

Eight capabilities, each targeting a different failure mode in LLM evaluation — from gut-feel scoring to prompt fragility to judge bias.

Multi-Dimensional Scoring

Ten task presets — QA, summarisation, RAG, code generation, creative writing, and more. Each dimension score includes a verbatim quote from the response. Not "this is good" but specific textual evidence.

Regression Testing

Save any eval report as a named baseline. Run future evals against it — per-dimension deltas are compared against configurable thresholds. Exit code 1 in CI when scores drop below your floor.

Hallucination Detection

Claim-level analysis against a source document. Each claim is either supported or unsupported — binary, not "mostly faithful." Risk levels: none, low, moderate, high, critical, with a safe_to_use flag for downstream gating.

Prompt Sensitivity Analysis

Test 2–5 prompt variants against a fixed response. Per-dimension variance tells you which dimensions are fragile and which are stable. Know which phrasings shift your scores before you deploy.

Panel Evaluation

Run N independent judge passes on the same eval. Mean and variance per dimension expose where judges agree and where they disagree. High-variance dimensions are flagged for human review automatically.

RAGAS-Compatible RAG Preset

The rag task type maps faithfulness, answer relevancy, context precision, and context recall — the four RAGAS metrics — as first-class evaluation dimensions with equal weighting.

Quickstart

Running in
three minutes.

Setup

git clone https://github.com/swapnanil/llm-eval-suite
cd llm-eval-suite
cp .env.example .env   # add your ANTHROPIC_API_KEY
docker-compose up api

CLI — compare two responses

docker-compose run cli eval \
  --file examples/eval_qa.json \
  --mode compare --format markdown

CLI — hallucination detection

docker-compose run cli hallucination \
  --response output.txt --source source.txt --format markdown

CLI — regression check vs saved baseline

docker-compose run cli regression run results.json \
  --id prod-baseline --format markdown

GitHub Actions — gate model upgrades in CI

- name: Run LLM eval
  run: docker-compose run cli eval \
    --file evals/suite.json \
    --mode rank --format junit \
    --output results.xml
- uses: mikepenz/action-junit-report@v3
  with:
    report_paths: results.xml
- name: Regression check
  run: docker-compose run cli regression run \
    results.json --id prod-baseline
  # exits 1 if any dimension drops beyond threshold

Example

Two responses in.
Clear winner out.

Input — compare eval JSON

{
  "task_type": "qa",
  "eval_mode": "compare",
  "source":
    "Refunds within 14
     days if unused",
  "responses": [{
    "label": "Response A",
    "text":
      "Refund in 14 days
       if unused."
  }, {
    "label": "Response B",
    "text":
      "30-day return,
       no questions asked."
  }]
}

Output — evaluation result

winner:  Response A
margin:  clear

Response B — Faithfulness
score:     1.0 / 10
reasoning:
  "States '30-day policy'
   — source specifies 14
   days. Clear hallucin-
   ation."
quote:
  "30-day return policy,
   no questions asked"

LLM OutputEval

Evidence-backed.CI-ready.

Evidence.Regression.Consensus.

Running inthree minutes.

Two responses in.Clear winner out.

Five more tools.Same standard.

LLM Output
Eval

Evidence-backed.
CI-ready.

Evidence.
Regression.
Consensus.

Running in
three minutes.

Two responses in.
Clear winner out.

Five more tools.
Same standard.