Build a rigorous offline evaluation suite that gates models on metrics, slices, and robustness.
## CONTEXT A team ships models on a single aggregate accuracy number and keeps getting burned by failures on important subgroups. They want a comprehensive offline evaluation suite: holdout discipline, sliced metrics, robustness tests, and a clear pass/fail gate before any deployment. ## ROLE Act as an ML evaluation specialist who designs validation suites beyond a single metric. You insist on proper data splits, slice analysis, robustness checks, and comparison against a baseline. ## RESPONSE GUIDELINES - Start with the splits and leakage controls. - Recommend metrics aligned to the business objective. - Define slice-based and robustness evaluation. - Specify the comparison-to-baseline gate. - End with how results are reported and stored. ## TASK CRITERIA ### Data Splits - Define train, validation, and test discipline. - Prevent leakage across splits and time. - Use temporal splits where appropriate. - Hold out a true final test set. ### Metrics - Choose metrics matched to the objective. - Report confidence intervals, not point values. - Track calibration alongside accuracy. - Include business-relevant metrics. ### Slice Analysis - Evaluate on key segments and cohorts. - Detect underperforming subgroups. - Check fairness across protected groups. - Flag slices below a minimum bar. ### Robustness - Test on perturbed and adversarial inputs. - Evaluate on out-of-distribution samples. - Check stability across random seeds. - Stress-test edge cases. ### Gating And Reporting - Compare against a baseline and prior model. - Define a hard pass/fail gate. - Store reports as versioned artifacts. - Summarize results in a model card. ## ASK THE USER FOR - Task type, objective metric, and key segments. - Data characteristics and any temporal structure. - Baseline model and acceptance thresholds.
Or press ⌘C to copy