Incident response playbook for small and medium teams

When a production server dies at night, there is no time to invent a process. A minimal incident response playbook for small teams: roles, comms, steps.

Roles during an incident

Clearly assigned roles eliminate chaos. Even a small team needs at least:

Incident Commander (IC) - manages the incident, makes decisions, escalates. Doesn't write code.
Technical lead - the person who knows the problem area. Writes the fix.
Communications lead - updates the status page, customers and management. In a small team this can be the same person as the IC.
Scribe - records the timeline: who does what, and when. Critical input for the post-mortem.

In a team of up to 5 people, the IC and Technical lead are usually split off and the IC takes the remaining roles.

Severity levels

Define 3-4 severity levels in advance:

SEV1 (Critical) - main service completely unavailable. Respond immediately, page even at night.
SEV2 (Major) - significant degradation (part of the features, some users). Response within 30 min during work hours, within 1 h outside.
SEV3 (Minor) - small impact, a workaround exists. Response by the next work day.
SEV4 (Cosmetic) - no user-facing impact, goes into the normal queue.

SEV1 procedure: the first 15 minutes

00:00 - Detection. An alert comes from monitoring or from a customer. Someone on the on-call rotation confirms it's a real problem.
00:02 - IC activation. A dedicated communication channel opens (Slack #incident-N or Discord) and the Technical lead is called.
00:05 - First status update. On the public status page: "Investigating reports of [problem]." Email the internal people who should know.
00:10 - Initial diagnostics. Check recent deploys, monitoring graphs and error logs. What changed in the last 30-60 min?
00:15 - Decision: rollback, hotfix, or workaround? If it's not immediately clear, the IC escalates or calls in more engineers.

Communication: the "5 + 30" rule

Update the status page at least every 30 minutes during a SEV1, even if you have no new info. "Still investigating, ETA still unknown" is a better update than silence - customers at least know someone is working on the problem.

The first update must come within 5 minutes of detection, regardless of whether you yet know the cause.

Rollback as default

For a SEV1 that started shortly after a deploy, rollback is the first choice, not a hotfix. A hotfix written under pressure at night is a source of more bugs. A rollback restores a known good state and gives you time for calm diagnostics in the morning.

A rollback requires:

Deployment versioning (Docker tags, Git tags, deployment artifact)
Database migrations that are backwards compatible (if the new version drops a column, the old one won't recover it)
A documented rollback procedure (commands, how long it takes, who can run it)

Post-mortem within 48 hours

After every SEV1/SEV2 incident, write a document with this structure:

Summary - 2-3 sentences on what happened, how long it lasted and who it affected
Timeline - exact times of detection, key actions and resolution
Root cause - why it happened (technical reason + procedural reason)
Impact - number of affected customers, business impact
What went well - what we did well (appreciate the team)
What went wrong - where we lost time
Action items - specific tasks with an owner and deadline (not "test better" - rather "add an integration test for the payment flow within 14 days, owner: X")

Blameless post-mortem rule: You're not looking for a culprit, you're looking for a systemic weakness. "John deployed a bad version" is the wrong conclusion - the right one is "the deploy process didn't include a canary phase that would catch the bug before full rollout."

On-call rotation

For a team with a 24/7 product:

Weekly rotations are the sweet spot - shorter ones burn people out, longer ones overload the primary on-caller
Always have a primary + secondary on-call. The secondary takes over when the primary doesn't respond within 5 min.
Compensate for night and weekend paging (extra payment or time off)
Run a monthly "on-call review" - who was woken up most often, and what can be automated

Tools

Monitoring (ePulz.io, Datadog, New Relic) - detection + alerting
Paging (PagerDuty, Opsgenie, Better Stack) - escalation, rotation, SMS/voice
Status page (own, ePulz.io, Statuspage.io) - external communication
Runbook hosting (Notion, GitHub Wiki, internal docs) - playbooks and runbooks
Postmortem template (Confluence, Notion template) - standardization

Conclusion

Incident response can't be improvised under pressure. 30 minutes invested in writing a playbook pays off at the first major outage - less chaos, shorter MTTR, better communication with customers and a team that knows what to do.

Start with automatic detection

ePulz.io solves the first minutes of incident response: detection + alerting via e-mail, Telegram and webhook to Slack / Discord / PagerDuty.

Start monitoring →