Incident response playbook for small and medium teams
· 8 min read
In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.
In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.
Roles during an incident
Clear roles eliminate chaos. Even a small team needs at least:
- Incident Commander (IC) - manages the incident, makes decisions, escalates. Doesn't write code.
- Technical lead - the person who knows the problem area. Writes the fix.
- Communications lead - updates status page, customers, management. For a small team can be same as IC.
- Scribe - writes timeline, who does what, when. Critical for post-mortem.
In a team up to 5 people typically IC + Technical lead = 2 roles split, others taken by IC.
Severity levels
Define 3-4 severity levels in advance:
- SEV1 (Critical) - main service completely unavailable. Respond immediately, page even at night.
- SEV2 (Major) - significant degradation (part of features, some users). Response within 30 min during work hours, within 1 h outside.
- SEV3 (Minor) - small impact, workaround exists. Response by next work day.
- SEV4 (Cosmetic) - no user-facing impact, goes into normal queue.
SEV1 procedure: the first 15 minutes
- 00:00 - Detection. Alert comes from monitoring or from a customer. Someone on-call rotation confirms it's a real problem.
- 00:02 - IC activation. A dedicated communication channel opens (Slack #incident-N or Discord). Technical lead is called.
- 00:05 - First status update. On public status page: "Investigating reports of [problem]." Email to internal people who should know.
- 00:10 - Initial diagnostics. Check recent deploys, monitoring graphs, error logs. What changed in the last 30-60 min?
- 00:15 - Decision: rollback, hotfix, or workaround? If not immediately clear, IC escalates or calls more engineers.
Communication: the "5 + 30" rule
Update status page at least every 30 minutes during SEV1, even if you have no new info. "Still investigating, ETA still unknown" is a better update than silence - customers at least know someone is working on the problem.
The first update must be within 5 minutes of detection. Regardless of whether you know the cause.
Rollback as default
For SEV1 that started shortly after a deploy, rollback is the first choice, not hotfix. Hotfix under pressure at night is a source of more bugs. Rollback restores a known good state and gives time for calm diagnostics in the morning.
For rollback you need:
- Deployment versioning (Docker tags, Git tags, deployment artifact)
- Database migrations backwards compatible (if new version drops a column, the old one won't recover)
- Documented rollback procedure (commands, how long it takes, who can run it)
Post-mortem within 48 hours
After every SEV1/SEV2 incident write a document with structure:
- Summary - 2-3 sentences what happened, how long it lasted, who it affected
- Timeline - exact times of detection, key actions, resolution
- Root cause - why it happened (technical reason + procedural reason)
- Impact - number of affected customers, business impact
- What went well - what we did well (appreciate the team)
- What went wrong - where we lost time
- Action items - specific tasks with owner and deadline (not "test better" - rather "add integration test for payment flow within 14 days, owner: X")
Blameless post-mortem rule: You're not looking for a culprit, you're looking for a systemic weakness. "John deployed a bad version" is the wrong conclusion - the right one is "the deploy process didn't include a canary phase that would catch the bug before full rollout."
On-call rotation
For a team with a 24/7 product:
- Weekly rotations - shorter burns out, longer means continuity of context
- Always primary + secondary on-call. Secondary takes over when primary doesn't respond within 5 min.
- Compensation (extra payment, time-off) for night and weekend paging
- Monthly "on-call review" - who was often woken up, what can be automated
Tools
- Monitoring (ePulz.io, Datadog, New Relic) - detection + alerting
- Paging (PagerDuty, Opsgenie, Better Stack) - escalation, rotation, SMS/voice
- Status page (own, ePulz.io, StatusPage.io) - external communication
- Runbook hosting (Notion, GitHub Wiki, internal docs) - playbooks and runbooks
- Postmortem template (Confluence, Notion template) - standardization
Conclusion
Incident response can't be improvised under pressure. 30 minutes invested in writing a playbook pays off at the first major outage - less chaos, shorter MTTR, better communication with customers and a team that knows what to do.
Start with automatic detection
ePulz.io solves the first minutes of incident response: detection + alerting via e-mail, Telegram and webhook to Slack / Discord / PagerDuty.
Try ePulz.io free - 7 days, no credit card needed.
Create account