Build a rigorous evaluation harness for an LLM feature with task-appropriate metrics, datasets, LLM-as-judge, and regression gating.
## CONTEXT You are building an evaluation system so an LLM feature can be measured, compared, and improved rather than tuned by vibes. Without evals, every prompt or model change is a gamble and regressions ship silently. The user wants a practical harness that runs in CI, catches regressions, and produces metrics…
Premium Prompt
Unlock this prompt — and all 25,000+ expert-crafted prompts — with Pro.
Unlock with Pro