Back to blog

Incident response playbook for small and medium teams

· 8 min read

In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.

In brief: When the production server falls at 23:30 on Friday, it's not the time to invent a process from scratch. An incident response playbook is a document that defines roles, communication and procedures before you need them. Here's a minimal playbook suitable for a 5-20 person team.

Roles during an incident

Clear roles eliminate chaos. Even a small team needs at least:

  • Incident Commander (IC) - manages the incident, makes decisions, escalates. Doesn't write code.
  • Technical lead - the person who knows the problem area. Writes the fix.
  • Communications lead - updates status page, customers, management. For a small team can be same as IC.
  • Scribe - writes timeline, who does what, when. Critical for post-mortem.

In a team up to 5 people typically IC + Technical lead = 2 roles split, others taken by IC.

Severity levels

Define 3-4 severity levels in advance:

  • SEV1 (Critical) - main service completely unavailable. Respond immediately, page even at night.
  • SEV2 (Major) - significant degradation (part of features, some users). Response within 30 min during work hours, within 1 h outside.
  • SEV3 (Minor) - small impact, workaround exists. Response by next work day.
  • SEV4 (Cosmetic) - no user-facing impact, goes into normal queue.

SEV1 procedure: the first 15 minutes

  1. 00:00 - Detection. Alert comes from monitoring or from a customer. Someone on-call rotation confirms it's a real problem.
  2. 00:02 - IC activation. A dedicated communication channel opens (Slack #incident-N or Discord). Technical lead is called.
  3. 00:05 - First status update. On public status page: "Investigating reports of [problem]." Email to internal people who should know.
  4. 00:10 - Initial diagnostics. Check recent deploys, monitoring graphs, error logs. What changed in the last 30-60 min?
  5. 00:15 - Decision: rollback, hotfix, or workaround? If not immediately clear, IC escalates or calls more engineers.

Communication: the "5 + 30" rule

Update status page at least every 30 minutes during SEV1, even if you have no new info. "Still investigating, ETA still unknown" is a better update than silence - customers at least know someone is working on the problem.

The first update must be within 5 minutes of detection. Regardless of whether you know the cause.

Rollback as default

For SEV1 that started shortly after a deploy, rollback is the first choice, not hotfix. Hotfix under pressure at night is a source of more bugs. Rollback restores a known good state and gives time for calm diagnostics in the morning.

For rollback you need:

  • Deployment versioning (Docker tags, Git tags, deployment artifact)
  • Database migrations backwards compatible (if new version drops a column, the old one won't recover)
  • Documented rollback procedure (commands, how long it takes, who can run it)

Post-mortem within 48 hours

After every SEV1/SEV2 incident write a document with structure:

  1. Summary - 2-3 sentences what happened, how long it lasted, who it affected
  2. Timeline - exact times of detection, key actions, resolution
  3. Root cause - why it happened (technical reason + procedural reason)
  4. Impact - number of affected customers, business impact
  5. What went well - what we did well (appreciate the team)
  6. What went wrong - where we lost time
  7. Action items - specific tasks with owner and deadline (not "test better" - rather "add integration test for payment flow within 14 days, owner: X")

Blameless post-mortem rule: You're not looking for a culprit, you're looking for a systemic weakness. "John deployed a bad version" is the wrong conclusion - the right one is "the deploy process didn't include a canary phase that would catch the bug before full rollout."

On-call rotation

For a team with a 24/7 product:

  • Weekly rotations - shorter burns out, longer means continuity of context
  • Always primary + secondary on-call. Secondary takes over when primary doesn't respond within 5 min.
  • Compensation (extra payment, time-off) for night and weekend paging
  • Monthly "on-call review" - who was often woken up, what can be automated

Tools

  • Monitoring (ePulz.io, Datadog, New Relic) - detection + alerting
  • Paging (PagerDuty, Opsgenie, Better Stack) - escalation, rotation, SMS/voice
  • Status page (own, ePulz.io, StatusPage.io) - external communication
  • Runbook hosting (Notion, GitHub Wiki, internal docs) - playbooks and runbooks
  • Postmortem template (Confluence, Notion template) - standardization

Conclusion

Incident response can't be improvised under pressure. 30 minutes invested in writing a playbook pays off at the first major outage - less chaos, shorter MTTR, better communication with customers and a team that knows what to do.

Start with automatic detection

ePulz.io solves the first minutes of incident response: detection + alerting via e-mail, Telegram and webhook to Slack / Discord / PagerDuty.

Start monitoring →


Try ePulz.io free - 7 days, no credit card needed.

Create account