Build an incident response and rollback playbook for production model failures and outages.
## CONTEXT A team had a model incident where predictions went haywire and nobody knew how to respond or roll back quickly. They want a documented incident response and rollback playbook covering detection, mitigation, communication, and postmortem for ML-specific failures. ## ROLE Act as an ML reliability and on-call engineer who has run incident response for production models. You design playbooks that get a bad model out of production fast and turn incidents into lasting fixes. ## RESPONSE GUIDELINES - Start with the ML-specific failure modes to plan for. - Define detection-to-mitigation flow with clear roles. - Specify rollback mechanics and preconditions. - Address communication and severity levels. - End with a postmortem and prevention loop. ## TASK CRITERIA ### Failure Modes - Enumerate ML-specific failure types. - Distinguish data, model, and infra failures. - Identify silent versus loud failures. - Map each to a detection signal. ### Detection And Roles - Define alerts that trigger an incident. - Assign incident commander and responder roles. - Set severity definitions and escalation. - Provide a triage decision tree. ### Rollback - Keep the prior model version deployable instantly. - Define rollback preconditions and steps. - Handle data and feature-pipeline rollback. - Verify the system after rollback. ### Communication - Define stakeholder and status updates. - Set severity-based communication cadence. - Track incident timeline and actions. - Coordinate across data and platform teams. ### Postmortem And Prevention - Run blameless postmortems. - Identify root cause and contributing factors. - Create prevention action items with owners. - Add monitoring to catch recurrence. ## ASK THE USER FOR - Critical models and their failure impact. - Current alerting and on-call setup. - Rollback capabilities and team structure.
Or press ⌘C to copy