Postmortems that actually change behaviour

Two winters ago I sat in a postmortem for a settlement job that had double-processed a batch of refunds. By every standard the review was a good one: nobody got blamed, and the action items were sensible. I left satisfied, but later that afternoon I searched our incident archive for the same system and found a writeup from fourteen months earlier describing nearly the same failure. That document was thorough too, and it had five sensible action items. Three of them had never been started, including the one that would have prevented the incident we had just spent ninety minutes dissecting.

That old writeup changed my definition of what a postmortem is for, because we had clearly run an excellent meeting fourteen months earlier and learned nothing from it.

The ritual itself is worth defending. The blameless postmortem descends from the way aviation and medicine review their failures: assume the person closest to the failure acted reasonably on the information in front of them, and aim the investigation at why the system made that reasonable action dangerous. The rule is practical before it is kind: engineers tell the truth about what happened only when the truth is safe to tell, and the timeline you get from a frightened engineer is fiction. All of that holds, and I run reviews this way without exception.

What the standard telling skips is the order of operations: the safety cultures this practice came from spent years building trust before they asked anyone for an honest account of a bad day. Most engineering teams adopt the template first, declare blamelessness in the meeting invite, and wonder why the room stays guarded.

And even teams that get the honesty right stall where mine did: the meeting ends with a list, the list goes into a tracker, and nobody opens the tracker again. A postmortem succeeded only if the lesson stopped depending on anyone remembering it, and what follows is how I try to pass that test.

Define the product first

Before the meeting, I write down the exit question and bring it with me: what will someone on this team do differently after this, and where will that difference live? The writeup is not an answer to it: a document is the receipt for the learning, and a beautifully formatted receipt for a purchase that never happened is the most common postmortem outcome I know. If the best answer in the room is that people will be more careful, the postmortem has not happened yet. Nobody changes how they work by deciding to be more careful, because careful is what everyone already believed they were being.

Earn blamelessness between incidents

You earn the honesty a postmortem needs long before the incident. When an engineer tells you a migration is two weeks late, or mentions in standup that they nearly dropped a production table on Friday, your reaction in that small moment is the team's real safety policy, whatever the process document says. React with curiosity to the small failures and people bring you the truth in the big review; react with visible disappointment and the next timeline you receive gets edited for your comfort. You cannot announce trust into existence during the worst week of the quarter.

Blameless means no one is punished for the failure, but it does not mean no one is responsible for the fix: the person is safe, and the commitment still carries a name. Teams that drop the second half drift into a politeness where every incident is everyone's fault and therefore no one's, and engineers can tell when a ritual is being performed for the process rather than for them. I have written about psychological safety as observable behaviour in the general case; this is what it looks like in the room where production just broke.

Fund one fix, decline the rest out loud

The standard hygiene advice says every action item needs an owner and a deadline. True, and not enough. The dead items in our fourteen-month-old writeup had owners, but they never had room: they were filed as side work the roadmap was never asked to make space for, so the roadmap won, silently, the way it always does.

So I cap the output: one fix, sometimes two. Funded means a named owner and a place in the same capacity conversation as feature work, this sprint or next rather than "when things calm down." Every other candidate gets declined in the meeting, out loud: we are choosing not to do this one. Saying that sentence feels wrong the first few times, but it is still more honest than the silent version, where item four dies quietly at ninety days and the engineers watching conclude the ritual is decorative. An explicit no is a decision someone can revisit later with new evidence; a drifted item is just proof the meeting did not matter. Findings too big for one team's capacity are a budgeting conversation, which belongs with reliability as a product feature rather than dying as item six on a team-level list.

Build the lesson into the work

For the fix itself, I rank the options on a durability ladder, ordered by what it takes for the lesson to disappear.

People's heads. The lesson disappears on its own, within weeks, faster if anyone changes teams.
A checklist or a runbook. The lesson survives until the first real deadline crunch, which is exactly the moment it was written for.
The path of the work. The lint rule, the CI gate, the safer default, the permission nobody has anymore, the risky script deleted outright. The lesson survives until someone deliberately removes it, and removing it forces them to confront why it exists.

Aim as far down the ladder as the fix allows, and prefer subtraction where you can get it: deleting the footgun beats adding a warning label to it. The rung to distrust is any fix whose mechanism is human vigilance: "We will remember to check next time" is a wish. Teams do not refuse to learn: they forget under deadline pressure, reliably, every time, and a fix that requires them not to has failure built into its design.

The refund postmortem left us one fix on each of the bottom two rungs, which is why I trust the ladder. We wrote a pre-deploy checklist for settlement changes, and we added a CI check that blocked any settlement job missing an idempotency key. The checklist was followed carefully for a quarter and then died during a crunch nobody even remembers as a crunch, but the CI check is still there. New engineers hit it, ask why, and learn about the incident from the failure message, which links to the original writeup. The checklist needed someone to remember it; the gate never did.

Scale the review to surprise

Most incident processes scale the review to the severity: a sev-1 gets the full room, a sev-3 gets a thread in chat. Severity is the right dial for the response, the paging and the customer comms, but the wrong dial for the review, because severity measures customer impact, and reviews exist to fix your model of the system. The incident that deserves the deep review is the one that surprised you. A sev-1 with a boring, well-understood cause needs follow-through rather than investigation. When nobody can explain a sev-3, your map of the system is wrong somewhere, and the next sev-1 grows in exactly that blind spot.

We learned this the expensive way: a retry storm once briefly doubled our traffic to a partner bank's sandbox. Customer impact was near zero, a sev-3 by every definition, and nobody could fully explain why the backoff had not kicked in. Because our process keyed the review to severity, that question lived and died in a chat thread. Four months later the same backoff bug took down production payments for an afternoon. The hardest moment of the second postmortem was scrolling up and reading our own four-month-old shrug.

Look back at almost any incident and you find the warnings were visible for months. Reading those signals forward, before the incident, is its own discipline; reviewing by surprise is the same habit pointed backward.

Close the loop, then prune

Before the room empties, I book a fifteen-minute reconvene thirty days out, because a follow-up scheduled "once things settle" is sitting on the heads rung of the ladder. The reconvene asks three questions:

Did the funded fix ship?
Did the behaviour stick?
Did we add anything in the panic that we should now remove?

Across quarters, the metric I trust is recurrence: did this class of failure happen again, and if it did, did it at least fail differently? Action-item completion rate measures activity: a team can complete eighty percent of its items, all from the shallow end of the list, and keep having the same incident.

The pruning question matters more the longer a team lives: every incident tends to add a control, and almost no team has a ceremony for removing one. Let that compound for two years and deploys take a week because half the gates guard against incidents nobody remembers, and the team starts routing around its own process, which is how the next incident starts. Once a year I list every control a postmortem created and ask which ones we would not reinstate today. Deleting a control nobody would reinstate is also a behaviour change, and it is the one that proves the ritual serves the team rather than the other way around. The scheduled cousin of all this, the retro that fires on a calendar instead of on a failure, has its own essay.

When the actor was an agent, review the gate

A growing share of incidents now have a generated change or an autonomous action somewhere in the causal chain, and I watch reviews handle them badly in two predictable directions. The first is fix-the-prompt: the review concludes that the instructions given to the tool were imprecise, because the prompt is the lever engineers feel they control, and the systemic questions never get asked. The second is quieter and worse: we stop blaming the engineer who typed the change and start blaming the engineer who approved it. "You should have caught the agent's mistake" is the new "you fat-fingered it," and it produces the same hiding it always did.

When the actor was an agent, the system under review is the gate the human stood at:

Did the approver have the time, context, and authority the gate's design assumed?
Was the blast radius what the autonomy decision believed it was?
Would anything other than luck have caught this before production?

Ratchet up vigilance requirements instead and engineers do what people always do when asked for more attention than anyone has: they rubber-stamp faster, or they abandon the tool. Either way the organization learns nothing about its gates.

The review of this kind I am least proud of spent twenty minutes on the approving engineer's diff-reading habits and two on the fact that the change had gone to every market at once. We were interrogating one person's attention while the gate's blast radius sat in plain sight, unexamined.

Containment and rollback when an agentic system misbehaves is its own subject, and so is detecting the failures nothing flagged. The postmortem's share of the agent era is the gate.

When a postmortem is the wrong tool

If the failure was a deliberate bet, a feature that might flop, a migration that might not pay off, then the best review was written before the work started: kill criteria, agreed in advance, that say what failure will look like while walking away is still cheap. No retro competes with that on its own ground. Postmortems earn their keep on the other failures, the ones nobody could have pre-committed around, where the system did something your model said it would not. The before-the-fact mirror of this essay is rehearsing failures you have not had yet, and the two practices feed each other.

So the meeting and the writeup both matter, and so does where the writeup lives so the team can find it in fourteen months, which is a question about institutional memory. But none of those is the product. When the meeting is forgotten and the document is three reorgs deep in a wiki, what about this team still behaves differently?

Of everything that came out of our refund incident, the only fix still working today is the one that no longer needs us to remember it.

Starter kit

A run sheet to tick through, from the all-clear to the annual prune

Before the meeting

Review depth matches how surprised you were (severity sized the response and the comms)
Exit question written down: what will someone do differently, and where will that difference live?

In the room

Funded fix (one, maybe two): ___________ Owner: ___________ Sprint: ___________
Ladder rung the fix sits on (aim for the path of the work): ___________
Every other candidate declined out loud, by name: ___________
Agent in the causal chain? Gate questions asked (approver's time and context, blast radius, detection without luck)

Before the room empties

Thirty-day reconvene on the calendar: date ___________

Thirty days later

Funded fix shipped?
Behaviour stuck?
Anything added in the panic now removed: ___________

Once a year

Postmortem-born controls listed, and the ones you would not reinstate deleted. Last prune: ___________

Run this when: the incident is closed and the review invite is about to go out.

A postmortem succeeded only if the lesson stopped depending on anyone remembering it.