Evaluate AI systems with the rigour they demand
Traditional software testing falls short for AI. EvalSpec provides a structured 5-dimension framework to assess accuracy, reliability, safety, alignment, and robustness — so you ship with confidence.
Start EvaluatingThe 5 Evaluation Dimensions
Each dimension targets a distinct failure category. Together, they provide comprehensive coverage of AI system quality.
Accuracy & Groundedness
Is the output factually correct? Is it grounded in the provided context rather than fabricated?
Consistency & Reliability
Does the system deliver similar quality for similar inputs? Are outputs predictable in format and structure?
Safety & Compliance
Does the system refuse harmful requests? Can it withstand prompt injection? Does it meet regulatory requirements?
Alignment & Usefulness
Does the output serve the user's actual intent? Is the tone, length, and format appropriate for the context?
Robustness & Edge Cases
How does the system handle unexpected, malformed, or adversarial input? Does it degrade gracefully?
Why 5 Dimensions?
Single-metric evaluation misses critical failure modes. A system can be accurate but unsafe, consistent but misaligned, or robust but unhelpful.
Accuracy alone is insufficient
A system that gives correct but harmful answers, or accurate outputs in the wrong format, still fails users. Accuracy without safety and alignment is dangerous.
Safety requires dedicated testing
Prompt injection, jailbreaking, and data leakage are adversarial problems that standard quality metrics never detect. They require purpose-built test cases.
Consistency reveals systemic issues
Non-deterministic systems can pass spot checks while failing in production. Measuring variance across identical inputs exposes hidden reliability problems.
Alignment captures user intent
The gap between "technically correct" and "actually useful" is where most AI systems fail. Alignment testing ensures outputs serve real user needs.
Edge cases define production readiness
Real-world inputs are messy, contradictory, and unexpected. Robustness testing reveals how your system behaves when the textbook ends and reality begins.
Holistic coverage prevents blind spots
Each dimension addresses failure modes invisible to the others. Together, they create a comprehensive quality picture — not just a point estimate.
Test Case Library
Browse, filter, and select from curated test cases for evaluating AI systems across all dimensions.
Choose a new password
Enter and confirm your new password below.
Evaluation Builder
Define your system, select criteria, choose test cases, and run your evaluation.
Define Your System
Select Evaluation Criteria
These five dimensions define what "good" looks like for your AI system. We've pre-set the weights based on your system type — higher weight means that dimension counts more toward your overall score.
Hover the ⓘ icon next to each criterion to learn what it measures and when you might want to adjust its weight.
Select Test Cases
We've suggested test cases based on your system type. Select the ones you want to include, or add your own.
Run Evaluation
For each test case, paste the actual system output and score the result.
Your Projects
Session Analytics
Session-only event log. For production use, connect to a backend (Supabase, PostHog, etc.)
Pricing
Start free, pay only when you need more.
- Full evaluation report
- Export results (CSV)
- RAG status analysis
- All 5 evaluation dimensions
- Everything in Pay-as-you-go
- System comparison
- Priority support
- Longitudinal tracking
- Everything in Bundle of 5
- Full analytics dashboard
- Bulk evaluations
- Best per-evaluation rate
Secure payment via Stripe. Credits never expire.