Create a structured incident response runbook with severity levels, roles, comms templates, and a blameless postmortem framework.
## CONTEXT When production breaks, teams without a runbook waste the critical first twenty minutes deciding who does what, how to communicate, and when to escalate. A strong incident process defines severity levels with unambiguous triggers, an incident commander role kept separate from hands-on responders, structured status updates on a predictable cadence, and a blameless postmortem that produces real, owned action items. In 2026 this also integrates with on-call tooling (PagerDuty, Opsgenie, or native alerting), a status-page workflow for external communication, and SLO-aware decision-making about when an alert becomes a declared incident. The runbook must work as a live checklist under pressure, not as a document nobody opens. The most common failures are unclear ownership, communication chaos, and postmortems that assign blame instead of fixing the system. ## ROLE You are an SRE and incident commander who has run hundreds of production incidents. You optimize for fast, calm coordination and for learning over blame. You separate the person directing the response from the people executing the fix, and you insist on a single source of truth for the timeline. ## RESPONSE GUIDELINES - Produce a ready-to-use runbook with sections that work as a live checklist. - Define severity levels with concrete, unambiguous triggers and examples. - Provide copy-paste communication templates for internal and external updates. - Separate roles so no single person is both commanding and firefighting. - Include a postmortem template that is explicitly blameless. - Keep the language scannable so it is usable mid-incident, not just in calm. ## TASK CRITERIA ### Severity Classification - Define Sev1 through Sev4 with measurable triggers (user impact, revenue, data loss). - Map each severity to escalation timing and who must be paged. - Clarify when an alert becomes a declared incident versus routine noise. - Tie severity to SLO and error-budget burn where the team tracks SLOs. - Provide concrete examples for each level to remove ambiguity. - Define how severity can be upgraded or downgraded mid-incident. ### Roles and Coordination - Define incident commander, communications lead, and operations responders. - Specify how the commander delegates and prevents duplicated effort. - Establish a single source of truth (incident channel or doc) for the timeline. - Set rotation and handoff rules for incidents that outlast one shift. - Keep the commander out of hands-on debugging to preserve oversight. - Define how to pull in subject-matter experts and vendors. ### Communication Cadence - Provide internal update templates with status, impact, and next update time. - Provide external and status-page templates calibrated to severity. - Define update frequency per severity and who approves customer-facing messages. - Include escalation language for involving leadership or vendors. - Standardize a closing message confirming resolution and follow-up. - Ensure updates set expectations even when there is no new progress. ### Mitigation Playbook - Prioritize stop-the-bleeding actions (rollback, feature flag, traffic shift) first. - Document safe rollback and how to confirm the mitigation actually worked. - List common failure classes and first-response steps for each. - Define criteria for declaring the incident resolved versus monitoring. - Capture decisions and their rationale in the timeline as you go. - Avoid risky speculative fixes during peak impact. ### Postmortem and Learning - Provide a blameless postmortem template with timeline and contributing factors. - Require concrete, owned, dated action items rather than vague intentions. - Distinguish root cause from triggers and contributing conditions. - Define how learnings feed back into alerts, runbooks, and architecture. - Quantify customer and business impact for prioritization. - Schedule the postmortem promptly while memory is fresh. ## ASK THE USER FOR - Your on-call tooling, team size, and current escalation setup. - Whether you maintain SLOs/error budgets and a public status page. - The most common incident types you face today. - Any compliance requirements for incident documentation or notification.
Or press ⌘C to copy
Copy and paste into your favorite AI tool
Explore more Coding prompts
Browse Coding