Ayhan Sipahi 2026-06-21

Blameless Postmortems: Stop Asking Who Failed, Ask What Failed

A blameless postmortem model that fixes the system instead of finding a culprit, with a copy-paste template and where individual accountability still applies.

The same outage keeps coming back because your incident review found a culprit instead of a cause. That is the failure mode to fix. An organization does not get safer by naming the person who typed the wrong command. It gets safer when the same wrong command can no longer take the system down. “Who did it” feels like an answer, so it closes the review. “What allowed it” stays open and forces a system change. The model that follows targets controls instead of people, with a copy-paste template, two diagrams, and a clear line for where individual accountability still applies.

The operating model below holds up across teams, but “blameless” is a culture change. The metrics that prove it landed take quarters, not a sprint. Where a claim is a durable principle, it is stated as one. Where it is still an open question for your context, that is flagged too.

Two Question Sets, Two Outcomes

The fastest way to see the difference between a blame culture and a blameless one is to look at the questions each asks in the room. The questions are not cosmetic. They steer where the analysis stops and what the action items look like.

A blame culture asks: who did it, who is responsible, whose fault was this? These questions terminate at a person. Once a name is on the page, the review feels complete, so it ends. The predictable result is that people learn to hide mistakes and near-misses, because reporting them is punished. Risk becomes invisible to leadership, and the system never learns what almost broke it.

A blameless culture asks a different set: how did this happen, why was it possible, which control was missing? These questions terminate at the system, not the person. Because the answer is a missing guardrail rather than a careless human, the action item is a guardrail you can build. People disclose more, because disclosure is rewarded with a fix instead of punished with a label. It is the same psychological safety that turns code review into knowledge sharing rather than nitpicking.

The two question sets create two self-reinforcing loops. One starves the organization of information; the other feeds it.

The reframe that holds both loops together is short: we do not ask who failed; we ask what failed. Everything that follows is mechanics for living up to that sentence.

What the Postmortem Is For

A postmortem is not a record of who was on shift. It is a written record of an incident: its impact, the actions taken to recover, the contributing factors, and the follow-up work that prevents recurrence. The Google SRE book is direct about the purpose: writing a postmortem is not punishment, it is a learning opportunity for the entire organization. The output is learning that ships, not a verdict that closes.

This reframing has a practical test. If your postmortem produces a tracked system change with an owner and a due date, it produced learning. If it produces a sentence like “the engineer will be more careful,” it produced blame wearing the costume of a corrective action. The first survives a re-read three months later; the second is forgotten by the next standup.

Senior participation matters here more than it looks. When a director sits in a blameless review and asks “what control was missing” instead of “who approved this,” the room learns the rules are real. The opposite is just as fast: one leader hunting for a name in one review will undo months of stated policy.

Human Error Is the Symptom, Not the Cause

This is the principle the whole model rests on. Human error is never the root cause of an incident; it is the symptom of something deeper in the system. The error is what the system allowed, not the reason the system was fragile.

Sidney Dekker frames this as the Old View versus the New View. The Old View treats human error as the cause of trouble, so the fix is to fix the human: retrain them, add an approval gate, write a warning in the runbook. The New View treats human error as a symptom of trouble already present in the system, so the fix is to fix the system that made the error easy and its consequences large. The two views look at the same incident and reach opposite action items.

John Allspaw gives this an operational lever, the “second story.” Every incident has a first story: what happened on the surface. It also has a second story: why the action made sense to the person at the time, given what they knew and saw. The load-bearing insight is simple. The action made sense to the person when they took it, because if it had not made sense, they would not have taken it. So the productive question is never “why were they careless.” It is “what made the wrong path look like the right path from where they stood.” That question points straight at the system: the confusing dashboard, the missing confirmation, the two environments that looked identical.

This is also why counterfactual reasoning is a trap. “They should have noticed the alert” describes a world that did not exist. In the world that existed, the alert was one of forty firing, or it was routed to a channel nobody watched. Analyze the conditions that made the wrong action reasonable, not the omniscient version of events you have now.

The Five Layers of Healthy Incident Analysis

A blameless analysis moves from the surface of the incident down to the system gap that allowed it. Each layer is a deeper question than the one above it. Most blame cultures stop at the second layer, on the human who made the first move. The work is to keep going.

Walk a concrete case through the layers. The event is that a deployment dropped a production table. The trigger is that an engineer ran a migration against the wrong environment. A blame culture writes “engineer ran the wrong migration” and stops. A blameless analysis treats that as the trigger, not the conclusion, and keeps descending. The contributing factors: the staging and production consoles were visually identical, and credentials were shared in a single terminal session. The control failure: there was no confirmation prompt and no separate-account boundary between environments. The system gap: nobody owned environment isolation, so the guardrail was never built. The fix lives at layers four and five. Retraining the engineer fixes none of it; the next engineer inherits the same trap.

James Reason’s Swiss Cheese Model is the picture behind this descent. Defenses are slices of cheese, each with holes; the holes are weaknesses that move and resize over time. An incident happens only when holes in several slices line up so a failure passes all the way through. Reason separates active failures, the unsafe act at the sharp end, from latent conditions, the dormant systemic weaknesses like poor tooling, unclear ownership, or a culture that punishes disclosure. Latent conditions can sit harmless for a long time before an active failure threads through them. A blameless postmortem is a hunt for latent conditions, because those are the holes you can permanently close.

The Enemy: Over-Attribution

The thing that quietly defeats most incident reviews is over-attribution: collapsing a multi-factor system failure onto a single human decision. Over-attribution feels like rigor. It produces a clean narrative, a clear owner, and a fast close. It is also wrong about how complex systems fail, and the wrongness costs you the next outage.

This is where I take a side on the most contested tool in the field: the Five Whys. Atlassian’s own blameless-postmortem guidance recommends using the Five Whys to walk the causal chain until you reach a “true root cause.” That is the interesting tension, because a champion of blameless culture is endorsing the exact tool that blameless purists reject. Allspaw’s “The Infinite Hows” and Salesforce’s “How, Not Why” make the case against it, and on balance the case is right for software systems.

The Five Whys has three problems. It is linear: it walks a single chain backward, while real incidents have several contributing factors interacting at once. It is non-repeatable: hand the same incident to three facilitators and you get three different “root causes,” because each “why” picks one parent and discards the rest. And it is reductionist: it promises a single root, which a tightly coupled system rarely has. A peer-reviewed critique of root cause analysis in healthcare lands the same point; the singular “root cause” framing pushes investigators toward a satisfying single story and away from the system.

My recommendation is to prefer “how” over “why,” and to prefer “contributing factors” over “root cause.” Ask how the failure became possible, which surfaces several conditions and invites you to map them. “Why” walks you to one person; “how” walks you to a set of system conditions. If your incident tooling ships a single “root cause” field, treat it as a label for the primary contributing factor, not as a claim that one cause existed.

Charles Perrow’s “Normal Accidents” is the honest backstop for leadership here. In systems that are both interactively complex and tightly coupled, some accidents are normal: not the result of negligence, but an inherent property of the system. Interactive complexity is the trigger; tight coupling is what propagates a small fault into a large one. The leadership message is uncomfortable and true: you cannot blame your way to zero incidents in a complex system. You can only design the system to fail smaller and recover faster.

What Strong Organizations Do Differently

The shift is from optimizing people to optimizing the system, and from punishment to correction. Strong incident cultures share a few concrete habits.

They write action items against controls, not humans. Every item changes a default, an automation, a guardrail, or an interface; none of them are “be more careful.” They track those items like any other prioritized work, with an owner and a due date, and they close them. Atlassian’s framing is blunt and correct: converting every corrective action into a tracked work item with an owner and a deadline is what separates teams that actually improve from teams that keep hitting the same problem.

They study success, not only failure. Erik Hollnagel’s Safety-II distinction is useful: Safety-I tries to minimize what goes wrong, while Safety-II studies why things usually go right. Things go right because people make sensible adjustments to real conditions, the “work-as-done” that differs from the “work-as-imagined” in the runbook. The implication for postmortems is to study normal operations too, because that is where your real resilience lives, and where your runbook is quietly out of date.

They watch learning velocity, with a caveat. I find it useful to track the share of postmortems that produce a shipped system change, as a homegrown signal of whether reviews are doing real work. Be honest that this is not a standardized industry metric; it is a local indicator, not a benchmark you can cite. The point is the direction: a postmortem that ships a change taught the system something, and one that did not, did not.

A few metrics are worth trending, none worth gaming. Action-item completion rate and median time-to-close show whether the loop closes. Repeat-incident rate shows whether you are fixing causes or symptoms. Mean time to detect and mean time to recover show whether your defenses and recovery are improving, as a trend rather than a vanity number. The most counterintuitive one is the near-miss reporting rate: in a healthy blameless culture, it should go up, because people feel safe surfacing what almost happened.

A Postmortem Template You Can Copy

A postmortem needs a fixed skeleton so the analysis cannot quietly skip the system. The sections below are the minimum. The last one is the most important, and it should never be empty.

# Postmortem: [short incident title]

## Summary
One or two sentences: what broke, who was affected, how long.

## What happened (timeline)
Factual event chain in order. What was known at each step, from
what the responders could see at the time. No interpretation,
no blame, no hindsight.

## Impact
Users affected, duration, scope, and any measurable effect.
Units and numbers, not adjectives.

## Detection gap
How long until we knew. Why not sooner? Which signal was missing,
late, or routed somewhere nobody watched.

## Control gap
Which guardrail failed or was absent. The control that should have
stopped this, or made it smaller, and did not.

## Contributing factors
The conditions that made the failure possible: process, tooling,
communication, ownership. List several; resist collapsing to one.

## Action items
System corrections only. Each item changes a control, default,
automation, or interface. Each has an owner and a due date.
No "be more careful" items.

| Action | Owner | Due | Status |
|--------|-------|-----|--------|
|        |       |     |        |

## How do we prevent recurrence?
The single most important section. State plainly what makes this
exact failure impossible or much smaller next time, and which
action item delivers it.

Note what the template does not have: a field for who caused the incident. That omission is deliberate. Responders appear in the timeline by role (“the on-call engineer,” “the deploying team”), never as a name to assign fault to. PagerDuty’s guidance abstracts the actor to “an unspecified responder” for exactly this reason; the goal is to keep attention on the action and its context, not the individual.

When to Override: Blameless Is Not Consequence-Free

Blameless does not mean no accountability, and selling it that way to leadership will get the whole model rejected the first time someone acts in bad faith. The honest frame is Dekker’s Just Culture: most errors are honest mistakes the system allowed, and a few are genuine negligence, recklessness, or malice that sit outside the blameless frame. Drawing that line clearly is what makes the model defensible to the people who sign off on it.

The decision is not “punish or forgive.” It is a triage with a test in the middle.

The substitution test is the hinge. It comes from James Reason’s work and was popularized in Dekker’s Just Culture. Would another competent person, with the same information and the same pressures, likely have made the same choice? If yes, the problem is the situation, not the person, and you stay in blameless analysis. If a reasonable peer clearly would not have, you have moved toward genuine recklessness, and the accountability path applies. Even then, the system learning still happens; an accountability case does not excuse the organization from fixing the conditions.

PagerDuty makes a related, honest point: perfect blamelessness is unrealistic, because human bias toward attribution does not vanish on command. They prefer “blame awareness,” naming the bias so the room can resist it, over a claim of purity nobody can fully keep. Treat blameless as a discipline you practice, not a state you achieve.

A Short Illustration

In one project, a background job silently double-charged a small set of records. The first review found the engineer who had merged the change, noted that the test coverage “should have caught it,” and closed with a note to add more tests. The same class of bug returned within a couple of release cycles, from a different engineer and a different code path.

The second time, the review went down the layers instead of stopping at a name. The contributing factors were that the job had no idempotency guarantee and no dry-run mode, and that two services could both write the same record with no shared lock. The action items changed the system: an idempotency key on the operation, a dry-run flag in the job, and an alert on duplicate writes. That class of bug did not come back. The difference was not a better engineer. It was that the second review treated the human as the symptom and went looking for the gap.

Common Pitfalls

A handful of failure modes show up again and again, each with a concrete fix:

Stopping the analysis at a human. The trigger is a person; the cause is a system gap. Keep descending past the name to the missing control.
Action items that say “retrain” or “be more careful.” Require every item to change a control, default, automation, or interface. If it only asks a human to try harder, it is not an action item.
Counterfactual reasoning. “They should have noticed” describes a world that did not exist. Analyze what made the wrong path look reasonable at the time.
Hindsight bias in the timeline. Build the timeline from what was known then, not what you know now. The responders did not have your summary.
Treating blameless as consequence-free. Pair it with an explicit Just Culture line for negligence and malice, so leadership trusts the model.
Action items with no owner and no due date. An untracked corrective action is a wish. Track it like any other prioritized work, and close it.

Conclusion

The blameless default holds for the overwhelming majority of incidents: assume the system allowed the error, descend through the five layers, and write action items against controls rather than people. Reach for the Just Culture override only when a competent peer clearly would not have made the same choice, which is the boundary that keeps “blameless” honest and keeps leadership on board. Prefer “how” over “why,” prefer “contributing factors” over a singular “root cause,” and judge a postmortem by whether it shipped a system change.

If you do one thing after reading this, change the question your next incident review starts with. Replace “who did this” with “what control was missing,” and let the template carry the analysis from there. Whether this culture change shows up in your repeat-incident rate is the thing to measure over the next few quarters; the principle is durable, the proof is local.

References

Google SRE Book — Postmortem Culture: Learning from Failure (Ch. 15) - Canonical definition of the blameless postmortem; “not punishment, a learning opportunity”; the role of senior management.
Google SRE Workbook — Postmortem Culture - Updated, more practical companion chapter with templates and adoption tactics.
Etsy / John Allspaw — “Blameless PostMortems and a Just Culture” - The foundational industry essay on the second story, just culture, and local rationality.
John Allspaw — “The Infinite Hows (or, the Dangers of the Five Whys)” - The case against linear root-cause analysis; prefer “how” over “why.”
Sidney Dekker — Just Culture - Old View versus New View; balancing safety and accountability; the substitution test.
Just Culture (Dekker) — summary - Accessible secondary summary of the book’s core arguments.
Swiss Cheese Model — Wikipedia - James Reason’s layered-defense model; active failures versus latent conditions.
Normal Accidents — Wikipedia - Perrow’s interactive complexity and tight coupling; why some failures are “normal.”
Erik Hollnagel — Safety-I and Safety-II - Resilience engineering; studying why things go right; work-as-done versus work-as-imagined.
Atlassian — How to run a blameless postmortem - Practical template and action-items-as-tracked-work; also where the Five Whys endorsement comes from.
PagerDuty — The Blameless Postmortem - “What and how, not why and who”; abstracting to an unspecified responder; “blame awareness.”
Salesforce Engineering — “How, Not Why” - A vendor-neutral reinforcement of the Five Whys critique.
BMJ Quality & Safety — “The problem with root cause analysis” - A peer-reviewed critique that root cause analysis promotes a reductionist single-cause view.
dastergon/postmortem-templates (GitHub) - An open collection of real postmortem templates to adapt for your own skeleton.

Team Conflict Resolution: A Field Guide to Turning Dysfunction into High Performance

A field guide to spotting, managing, and resolving conflict in software teams, with practical frameworks and early-warning systems that turn friction into performance.

leadershipteam-managementsoftware-engineering+5

September 13, 2025

Production Insights: Debugging Notification Delivery at Scale

Real-world debugging techniques, monitoring strategies, and lessons learned from notification system failures in high-stakes production environments

debuggingmonitoringproduction+4

September 8, 2025

Ways of Working, Written Down: The Documents a Mature Engineering Team Owns

A guide to the team documents a mature engineering team owns: onboarding, working agreements, Definition of Done, on-call, knowledge transfer, and what makes each one good.

engineering-cultureonboardingdocumentation+4

June 20, 2026

AI-Driven Teams and Agile: Do You Still Need a Named Methodology?

As AI absorbs more implementation, the choice is not which agile framework but four feedback loops; for AI-assisted teams, tune the loops, not the ceremony.

leadershipagilescrum+5

June 2, 2026

The Hidden Cost of Cultural Blindness: When Global Engineering Teams Fail

How cultural misunderstandings cost software projects billions and destroy team productivity - plus practical frameworks to build truly effective global teams.

leadershipteam-managementglobal-teams+3