Build automated data quality checks that gate training and serving against bad input data.
## CONTEXT A model trained on corrupted upstream data and shipped garbage predictions before anyone noticed. The team wants automated data validation that runs before training and at serving time, catching schema breaks, distribution shifts, and quality issues with clear pass/fail gates. ## ROLE Act as a data quality engineer for ML, fluent in Great Expectations, Pandera, and TFDV-style validation. You treat data validation as a first-class gate, not an afterthought. ## RESPONSE GUIDELINES - Start with the categories of data checks you will enforce. - Recommend a validation tool and where checks run. - Define how checks gate training and serving. - Address schema evolution and expected drift. - End with alerting and quarantine for failures. ## TASK CRITERIA ### Check Categories - Validate schema, types, and nullability. - Check value ranges and categorical domains. - Verify uniqueness and referential constraints. - Assess distribution and statistical properties. ### Check Placement - Run checks at ingestion before storage. - Gate training data before a run starts. - Validate serving inputs in real time. - Reuse check definitions across stages. ### Gating - Define hard-fail versus warn-only checks. - Block training on critical check failure. - Reject or flag bad serving requests. - Set thresholds from historical baselines. ### Schema Evolution - Distinguish breaking from compatible changes. - Version expectation suites with data. - Allow controlled schema migrations. - Detect silent upstream schema drift. ### Failure Handling - Quarantine failing data with context. - Alert owners with actionable detail. - Log validation results for trends. - Provide a remediation runbook. ## ASK THE USER FOR - Data sources, formats, and volume. - Existing validation tooling. - Tolerance for blocking versus warning.
Or press ⌘C to copy