Evaluation Methods
Spec27 supports several ways to determine whether an output passed.
Strict equality
Use strict equality when the output must match the expected answer exactly.
Best for:
- deterministic responses
- exact answer checks
- baseline validation
Permitted values
Use permitted values when multiple outputs are acceptable, but the set is still constrained.
Best for:
- fixed labels
- multiple approved variants
- simple classification-style outputs
Judge-based scoring
Use judge-based scoring when correctness depends on interpretation rather than exact matching.
Best for:
- rubric-based reviews
- nuanced policy checks
- outputs where explanation and scoring matter
Judge-based runs can include a structured score, explanation, and vote details.
How to choose
- Start with strict equality when you can.
- Use permitted values when exact matching is too rigid.
- Use judge-based scoring when human-like judgment is required.