Incident response playbook for small and medium teams

In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.

In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.

Roles during an incident

Clear roles eliminate chaos. Even a small team needs at least:

Incident Commander (IC) - manages the incident, makes decisions, escalates. Doesn't write code.
Technical lead - the person who knows the problem area. Writes the fix.
Communications lead - updates status page, customers, management. For a small team can be same as IC.
Scribe - writes timeline, who does what, when. Critical for post-mortem.

In a team up to 5 people typically IC + Technical lead = 2 roles split, others taken by IC.

Severity levels

Define 3-4 severity levels in advance:

SEV1 (Critical) - main service completely unavailable. Respond immediately, page even at night.
SEV2 (Major) - significant degradation (part of features, some users). Response within 30 min during work hours, within 1 h outside.
SEV3 (Minor) - small impact, workaround exists. Response by next work day.
SEV4 (Cosmetic) - no user-facing impact, goes into normal queue.

SEV1 procedure: the first 15 minutes

00:00 - Detection. Alert comes from monitoring or from a customer. Someone on-call rotation confirms it's a real problem.
00:02 - IC activation. A dedicated communication channel opens (Slack #incident-N or Discord). Technical lead is called.
00:05 - First status update. On public status page: "Investigating reports of [problem]." Email to internal people who should know.
00:10 - Initial diagnostics. Check recent deploys, monitoring graphs, error logs. What changed in the last 30-60 min?
00:15 - Decision: rollback, hotfix, or workaround? If not immediately clear, IC escalates or calls more engineers.

Communication: the "5 + 30" rule

Update status page at least every 30 minutes during SEV1, even if you have no new info. "Still investigating, ETA still unknown" is a better update than silence - customers at least know someone is working on the problem.

The first update must be within 5 minutes of detection. Regardless of whether you know the cause.

Rollback as default

For SEV1 that started shortly after a deploy, rollback is the first choice, not hotfix. Hotfix under pressure at night is a source of more bugs. Rollback restores a known good state and gives time for calm diagnostics in the morning.

For rollback you need:

Deployment versioning (Docker tags, Git tags, deployment artifact)
Database migrations backwards compatible (if new version drops a column, the old one won't recover)
Documented rollback procedure (commands, how long it takes, who can run it)

Post-mortem within 48 hours

After every SEV1/SEV2 incident write a document with structure:

Summary - 2-3 sentences what happened, how long it lasted, who it affected
Timeline - exact times of detection, key actions, resolution
Root cause - why it happened (technical reason + procedural reason)
Impact - number of affected customers, business impact
What went well - what we did well (appreciate the team)
What went wrong - where we lost time
Action items - specific tasks with owner and deadline (not "test better" - rather "add integration test for payment flow within 14 days, owner: X")

Blameless post-mortem rule: You're not looking for a culprit, you're looking for a systemic weakness. "John deployed a bad version" is the wrong conclusion - the right one is "the deploy process didn't include a canary phase that would catch the bug before full rollout."

On-call rotation

For a team with a 24/7 product:

Weekly rotations - shorter burns out, longer means continuity of context
Always primary + secondary on-call. Secondary takes over when primary doesn't respond within 5 min.
Compensation (extra payment, time-off) for night and weekend paging
Monthly "on-call review" - who was often woken up, what can be automated

Tools

Monitoring (ePulz.io, Datadog, New Relic) - detection + alerting
Paging (PagerDuty, Opsgenie, Better Stack) - escalation, rotation, SMS/voice
Status page (own, ePulz.io, StatusPage.io) - external communication
Runbook hosting (Notion, GitHub Wiki, internal docs) - playbooks and runbooks
Postmortem template (Confluence, Notion template) - standardization

Conclusion

Incident response can't be improvised under pressure. 30 minutes invested in writing a playbook pays off at the first major outage - less chaos, shorter MTTR, better communication with customers and a team that knows what to do.

Start with automatic detection

ePulz.io solves the first minutes of incident response: detection + alerting via e-mail, Telegram and webhook to Slack / Discord / PagerDuty.

Start monitoring →