2025-09-03 · Mika Okada
Runbooks That On-call Engineers Actually Open
Runbooks fail when they read like marketing copy for the service. We push for the first screen to answer three prompts: what broke, what safe checks to run, and who to ping if metrics disagree. Everything else moves down-page.
Command snippets should be copy-pasteable with placeholders clearly marked. We saw better adoption when teams embedded absolute paths to dashboards rather than generic "open Grafana." VPN quirks matter: if a link only works on corp DNS, say so upfront.
We also encourage a "last verified" line with the engineer initials. Freshness signals matter at 3 a.m. During Incident Response Drills for SRE, participants rewrite each other's runbooks using this rubric, which surfaces gaps before production heat.
The closing section should list known false positives. Nothing erodes trust faster than a runbook that cries wolf every week. Honest limitations keep the document alive.
#incident response #documentation