Evaluate AI systems with the rigour they demand

Traditional software testing falls short for AI. EvalSpec provides a structured 5-dimension framework to assess accuracy, reliability, safety, alignment, and robustness — so you ship with confidence.

Start Evaluating

The 5 Evaluation Dimensions

Each dimension targets a distinct failure category. Together, they provide comprehensive coverage of AI system quality.

Accuracy & Groundedness

Is the output factually correct? Is it grounded in the provided context rather than fabricated?

Key questions: Does the system hallucinate? Can it distinguish between what it knows and what it doesn't? Does it cite sources accurately?

Measurement: Factual verification against ground truth, citation accuracy scoring, hallucination rate tracking, knowledge boundary detection.

Consistency & Reliability

Does the system deliver similar quality for similar inputs? Are outputs predictable in format and structure?

Key questions: Does the same prompt produce wildly different results? Does output format remain stable? Are quality levels consistent across sessions?

Measurement: Semantic similarity across repeated runs, format compliance rate, variance analysis on quality scores.

Safety & Compliance

Does the system refuse harmful requests? Can it withstand prompt injection? Does it meet regulatory requirements?

Key questions: Can the system be jailbroken? Does it leak sensitive data? Does it comply with GDPR, EU AI Act, and internal policies?

Measurement: Prompt injection success rate, harmful content generation rate, PII leakage testing, compliance checklist verification.

Alignment & Usefulness

Does the output serve the user's actual intent? Is the tone, length, and format appropriate for the context?

Key questions: Does the system understand implicit instructions? Does it follow specified constraints (word count, tone, audience)? Is the output actionable?

Measurement: Instruction-following compliance, user satisfaction scoring, task completion rate, constraint adherence metrics.

Robustness & Edge Cases

How does the system handle unexpected, malformed, or adversarial input? Does it degrade gracefully?

Key questions: What happens with empty input? Extremely long input? Mixed languages? Contradictory instructions? Does it fail silently or communicate the issue?

Measurement: Error handling coverage, input boundary testing, graceful degradation assessment, recovery behavior analysis.

Why 5 Dimensions?

Single-metric evaluation misses critical failure modes. A system can be accurate but unsafe, consistent but misaligned, or robust but unhelpful.

Accuracy alone is insufficient

A system that gives correct but harmful answers, or accurate outputs in the wrong format, still fails users. Accuracy without safety and alignment is dangerous.

Safety requires dedicated testing

Prompt injection, jailbreaking, and data leakage are adversarial problems that standard quality metrics never detect. They require purpose-built test cases.

Consistency reveals systemic issues

Non-deterministic systems can pass spot checks while failing in production. Measuring variance across identical inputs exposes hidden reliability problems.

Alignment captures user intent

The gap between "technically correct" and "actually useful" is where most AI systems fail. Alignment testing ensures outputs serve real user needs.

Edge cases define production readiness

Real-world inputs are messy, contradictory, and unexpected. Robustness testing reveals how your system behaves when the textbook ends and reality begins.

Holistic coverage prevents blind spots

Each dimension addresses failure modes invisible to the others. Together, they create a comprehensive quality picture — not just a point estimate.

Test Case Library

Browse, filter, and select from curated test cases for evaluating AI systems across all dimensions.

Sign in

Access your evaluation projects.

Forgot your password?

Don't have an account? Create one

Create account

Start evaluating your AI systems.

Already have an account? Sign in

Reset password

Enter your email and we'll send you a reset link.

Back to sign in

Choose a new password

Enter and confirm your new password below.

Your Projects

+ New Evaluation

Back to Dashboard

Compare Evaluations

Session Analytics

Session-only event log. For production use, connect to a backend (Supabase, PostHog, etc.)

Pay-as-you-go

£9.99

1 evaluation

Full evaluation report
Export results (CSV)
RAG status analysis
All 5 evaluation dimensions

404

Page not found

The page you're looking for doesn't exist.

Back to home